15 types of data to collect when assessing your digital library

September 8, 2016

Stacy Konkiel

What’s there to measure in digital libraries?

For digital special collections, librarians often track metrics for both digitized objects and descriptive information related to those objects, including:

Compressed and uncompressed versions of images, videos, and audio files;
Text files (usually in XML, PDF, and HTML formats);
Descriptive information about the object, captured in a variety of metadata standards; and
Contextual information about an object or a collection described in accompanying essays, timelines, and visualizations.

Put simply, it would come as no surprise to anyone who has worked with digital collections to say that digital library content is heterogeneous and, in many cases, complicated to measure.

Measuring more and better: suggested data to collect

The list below is not comprehensive, but instead reflects metrics that are commonly used and easily tracked, and which can be used to understand attention paid to digital library content by various audiences.

Quantitative metrics

Quantitative metrics are arguably easier to capture than other types of data. They do not measure exactly how digital library items are used, but instead they are indicative of overall volume of interest in digital library content.

Page views are commonly tracked in web analytics software, and are often called “impressions”. Along with downloads, they can convey that items are being successfully exposed across the broader Web. Accurately tracking downloads can be problematic for digital special collections that are stored in digital object repositories and are referenced via persistent URLs (PURLs).
Visits, especially returning visits, can provide some indication of engagement.
Referring sites straddle the quantitative and qualitative realms in that a subset of referring sites serve as some indication of scholarly (i.e., Wikipedia and Google Scholar) or popular relevance (i.e., news outlets and Reddit) that could be further traced for citations and references in context.
Shares on social media signal the circulation of content across a potentially vast network of people.
Saves, Favorites, and Bookmarks can capture interest in a given item, and in some cases the intent to use items in future teaching.
Adaptations relate to the creation of derivative works: new research or creative works based on existing digital library images, data, research, software, etc. Accurately tracking adaptation is difficult to do, even in systems that provide mechanisms for doing so (i.e. forking in GitHub). For digital library content that requires end users to manually offer attribution (i.e. citing a digital collection in a book chapter), the problem can be more pernicious.
Requests for hi-resolution digital library content, submitted via automated means, could be an indicator of later citations or reuse and adaptations. Further study is needed.
Citations help us understand the use of our digital libraries in a scholarly context, particularly when cited in books and journal articles. Citations to digital library content can be difficult to uncover, however.
Visitor demographic information is another metric of interest to libraries. Demographic information like age and user interests can be sourced from third-party services like Facebook or Google (which are sometimes used to allow visitors to login to library websites), from IP addresses that help determine users’ location, or even from library-administered surveys. There are obvious privacy implications to tracking visitors’ demographic information.

Qualitative data

The use of qualitative data for assessment sometimes require manual collection and review, or personal engagement with a digital library’s users. It is invaluable in its ability to convey intent, especially when used alongside quantitative data.

Mentions can be as informal as a “shout-out” or as formal as a citation, though in either case the mention may not be constructed in easily traceable ways (i.e., citing a canonical URL or persistent identifier). In venues like Twitter and Wikipedia, where mentions are more easily tracked and aggregated, this data can be easily harvested to better understand context: what is being said about a particular item? And who is involved in the discussion? Mentions can be appear on the Web in many forms: course syllabi, blog posts, policy documents, and news articles (just to name a few).
Reviews or comments provide another avenue for determining value. The volume of comments often does not matter as much as the nature of the comments. In addition, a commenter’s identity can sometimes be equally important when analyzing comment content.
Reference inquiries often provide a story of scholarly use and engagement beyond web analytics. They also create opportunities to follow-up with users to learn more about their research interests with respect to the digital library resources. Reference inquiries are often collected and recorded on an ad hoc basis, being as they’re often submitted via email, telephone, or in person.

Timing is everything

For institutional use, it can be useful to collect and analyze metrics at times that coincide with both annual, library-wide internal reviews and external reporting events (like for ARL reports, LibQual statistics, and so on). That way, you can reuse the metrics collected for both purposes.

For end user-facing metrics, the delivery of stats should be immediate. For example, “This item has been downloaded x times” is a metric that’s more useful when reported in real time. If manual intervention is required to prepare metrics (such as to parse server logs for relevant information), metrics should be regularly delivered (i.e. weekly or monthly) and transparently reported, so users can understand what they are looking at and can evaluate the usefulness of those metrics accordingly.

Suggested assessment data sources

Following are recommended tools for getting started with data collection. A holistic evaluation framework that librarians might also find useful is JISC’s Toolkit for the Impact of Digitised Scholarly Resources (TIDSR).

Web server logs

Metrics reported: downloads, referring sites, visits, pageviews, limited demographic information.
Be sure to export information at regular intervals, as consistent collection is important for longitudinal analysis.
Web server log data often adhere to certain formats (Apache Custom Log Format, W3C Extended File Log Format, etc.) and can be processed and visualized for human consumption with the help of tools like Webalizer, AWStats, and Tableau.
Tableau is especially useful for web server log analysis, grouping, and visualization by creating dashboards, user population assessment, and usage over time.

Google Analytics

Metrics reported: downloads, referring sites, visits, pageviews, limited demographic information.
Google Analytics has some dashboard functionality that’s useful for saving elements you want to review regularly. GA is also useful for longitudinal analysis, showing big trends in traffic and use.
For digital collections: Szajewski (2013) has written an excellent guide to using the tool to measure the value of digital special collections.

Citation databases

Metrics reported: peer-reviewed journals and books that cite digital library resources.
Citations can be be sourced from subscription databases like Scopus and Web of Science, or from free platforms like Google Scholar and Google Books. Often, specialized searches are required to find citations to digital library content: “cited reference search” can be used in Web of Science, and free-text search for digital library or collections’ names can be employed in other databases to find references to digital library content.
Citations are much easier to track when persistent identifiers like DOIs are used by whomever is citing digital library content.

Altmetric Explorer for Institutions

Metrics reported: Shares, saves/favorites, adaptations, mentions, and some other quantitative and qualitative data sourced from the social web. For a full list of Altmetric’s data sources, check out our website.
Altmetric collects data from across the web related to any scholarly outputs, including any content in a subscriber’s digital special collection (no persistent identifier necessary) which can be displayed in embeddable badges on item records.
Altmetric provides important qualitative data behind the numbers we report. For example, in addition to seeing that items in your digital library have been mentioned 17 times on Wikipedia, you can also see exactly what has been written about them.

Altmetrics data via social media APIs

It is also technically possible for digital libraries to connect with individual social media platforms’ APIs to search for mentions of their content. In theory, one could monitor social media sites for mentions of relevant URLs.
The main drawback to this option is the the developer time required to build customized solutions for each digital library. It could possibly result in much duplicated effort.
Another possible drawback are the limitations placed on search APIs by platforms themselves; for example, researchers have pointed out that Twitter’s search API is typically restricted to fetching data from only the previous week, and the API’s rate limits restrict the retrieval of large amounts of data at once.

Qualitative data via Google Alerts and Mention

Track when your content has been shared on the web by setting a Google Alert or Mention alert for your:
- digital library’s name
- digital library’s base URL (http://collection1.libraries.psu.edu/cdm/singleitem/collection/amc/id/314/),
- your repository’s Handle or DOI shoulder (https://scholarworks.iu.edu/dspace/handle/2022/9564 and http://hdl.handle.net/2022/9564; http://dx.doi.org/10.5061/dryad.pp67h and http://datadryad.org/resource/doi:10.5061/dryad.pp67h), or
- special URLs created for collections (http://webapp1.dlib.indiana.edu/vwwp/welcome.do)
For important collections, you might also want to set alerts for titles or names relevant to those collections (i.e. for Penn State’s “Advertising Trade Cards from the Alice Marshall Women’s History Collection,” they might also set alerts for “Advertising Trade Cards” and “Alice Marshall”).
Google Alerts is free to use; Mention is a subscription service.

In Summary

Most digital libraries currently use relatively basic assessment strategies, often ones that are thin on evidence for how collections are being used. Adding altmetrics, citation data, contextual information, and other data to assessment practices could vastly increase libraries’ understanding of how their digital collections are being used and by whom. A number of free and subscription tools can make it easy to automate the collection of data, and analyzing the resulting data at regular intervals can keep assessment projects manageable.