We are pleased to be publishing a series of blogs authored by scientometrics researcher Stefanie Haustein over the coming weeks. This is the first post of a five-post series.
This first post in our mini series analyzes the What of scholarly Twitter data and thus focuses on what kind of content gets tweeted. We will explore if people link to scholarly papers when they tweet about research and will identify which document types, scholarly disciplines and journals receive the most attention on Twitter. Even though it would be highly informative towards our understanding of users’ motivations to tweet about scientific publications, content analysis is beyond the scope of this post. Instead, this week’s blog post will focus on the quantitative analysis of document metadata such as journal, discipline, publication year and document type, rather than on the tweet itself.
Links and persistent identifiers are essential to tracking scholarly Twitter activity
Altmetrics researchers have a koan that goes something like this: “If a paper is tweeted about but it’s not linked to, does it have altmetrics?” That’s because few people realize that altmetrics technology usually requires two things in order for services like Altmetric to pick it up:
- That online discussions about a scholarly document link to it; and
- That the linked webpage includes a persistent identifier such as a DOI or PubMed ID
The first requirement—linking to research outputs—is sensible, but is not as common as one would think. An early study found that as few as 6% of academics’ tweets linked to a scholarly publication. On top of that, half of these tweets referred to the article only indirectly via another website. These indirect links—so-called second-order events—are currently not captured by altmetrics data providers. So, a lot of the discussions happening on Twitter are not being picked up by altmetrics aggregators.
Then there is the need of persistent identifiers for research outputs shared on Twitter. This requirement exists so that altmetrics aggregators can disambiguate scholarly publications from other links. Unique identifiers (e.g., DOI, PubMed ID) tell altmetrics aggregators, “This is a journal article,” making it much easier to identify when relevant content is mentioned online. Although DOIs are the most common persistent identifiers for journal articles and can now also be assigned to preprints and datasets, not all types of research outputs come with a DOI. Since DOI registration costs might prevent journals in less affluent disciplines (e.g., arts and humanities) or developing countries from issuing DOIs, their prevalence is not distributed equally. This means that altmetric studies based on peer-reviewed journal articles with a DOI are biased towards STEM journals and well-off publishers from the Global North. In both the design of altmetrics aggregators and the methods used to study the diffusion of research on Twitter, the ‘DOI bottleneck’ represents one of the main limitations of currently captured altmetrics, as—based on Priem’s study mentioned above—they only represent a tiny fraction of relevant online discussions about research. For scholarly Twitter metrics to be representative of actual online discussions about research objects, new technologies that reflect the reality of how research is linked to online are needed to both track associated Twitter activity and study existing altmetrics data.
Most tweeted research is very recent
Due to the fast pace of communication on Twitter, it is only the most recent publications that gain the Twitterverse’s attention. As we will see in the post on when documents are tweeted, tweets linking to a paper appear shortly after its publication and often don’t receive any attention after a few days or weeks. The great majority of tweets captured by Altmetric between 2012 and 2016 therefore mention articles published during the same time span.
Based on the June 2016 Altmetric data dump (analyzed in the handbook chapter), the figure below shows that 54% of tweeted documents were published between 2012 and 2016. While 55% of tweets link to articles from 2013 to 2015, papers published in 2016 alone were the subject of more than one quarter of all tweets captured by Altmetric until mid 2016.
Nevertheless, tweeted articles were published covering a wide range of publication dates of tweeted articles, as shown in the figure above. The pink (and red) bars represent tweeted documents (and tweets) referring to articles published before 1950 or after 2020, respectively. In fact, 19% of documents had no publication date . At the same time, 438 documents had publication dates well into the future from when data was collected in June 2016. This is a common phenomenon that’s mainly caused by publisher metadata (which is why we need projects like Metadata2020). These inconsistencies can also be found in other document metadata  which are essential for the characterization of scholarly output diffused on Twitter. I have discussed the issues associated with publication dates and the challenges of determining the actual time of publication here and here.
Tweets link to both traditional journals and emerging publication venues
The majority of frequently tweeted sources are peer-reviewed journals indexed in the Web of Science. Links to repositories for articles (arXiv, SSRN, bioRxiv) or data (figshare, Dryad) or websites such as The Conversation or ClinicalTrials.gov are also mentioned regularly on Twitter. In fact, the largest number of tweeted documents are arXiv submissions: in the dataset we studied, a total of 319,411 arXiv eprints were tweeted 1.1 million times by 110,134 users. As shown in the scatterplot, the number of tweeted documents and unique users deviate particularly for the most popular sources. Although arXiv, PLOS ONE and SSRN are the most popular platforms according to the number of tweeted documents, Nature, The Conversation and PLOS ONE are tweeted by the largest number of users, suggesting a far reach on social media for articles published in these venues.
It should be mentioned that the document metadata in Altmetric is based on a variety of sources and thus shows some inconsistencies. For example, the source is unknown for 6% of tweeted documents and 4% of tweets in the Altmetric data dump and some sources appear in various spellings. Therefore, the number of distinct sources (49,379) or journal IDs (28,457) overestimates unique sources.
Academic journals are a major player in Twitter traction
Since there is no gatekeeping on Twitter as there is in academic publishing, tweeting activity can be heavily influenced or even gamed by individual users. Looking at scholarly journals, tweeting links to recent publications has developed into an important marketing strategy of publishing houses and other stakeholders. Many journals now have their own Twitter account to promote articles (e.g., @JASIST) or are at least represented through their publisher’s Twitter presence.
But journals weren’t the only Twitter users diffusing academic research. With more than 10,000 unique users sharing their content, PLOS ONE, BMJ, Nature, Science and PNAS were tweeted by the largest audience on Twitter. These journals reflect large multidisciplinary and biomedical serials with wide readership. However, the number of unique users, in particular for the general science journals, seems to be low in comparison to their global readership. For example, while Nature.com claimed to have 3 million unique visitors per month, its 2015 articles were tweeted by only 42,365 users. However, the number of users seems to be comparable to Nature’s print circulation. These statistics show that far more people are reading research than tweeting about it.
Biomedicine is the most tweeted discipline
Differences between scientific disciplines have been well documented as an important characteristic of tweeting behavior. While on average a bit more than one third of all Web of Science articles published in 2015 were mentioned on Twitter, large variations can be observed between scholarly disciplines: 59% of publications in Biomedical Research, Health and Psychology were tweeted, while in Mathematics and Engineering & Technology Twitter coverage was below 10%. Looking on the level of scientific specialties, the percentage of tweeted documents was highest in Parasitology (78%), Allergy (76%) and Tropical Medicine (70%). Articles from General & Internal Medicine (13.5) and Miscellaneous Clinical Medicine (12.3) received the highest average number of tweets per document. Many altmetric studies showed that multidisciplinary and biomedical journals as well as social science publications are particularly visible on Twitter, while the so-called hard sciences are tweeted about less.
News, editorial pieces, and other trends in highly-tweeted document types
Marking a clear distinction from citation patterns, document types that are usually considered uncitable in the context of bibliometrics—that is, documents which are usually not cited such as editorial material, news items and book reviews—show particularly high Twitter activity. For example, Twitter coverage for news items was twice that for all documents, while density—the average number of tweets per document—was highest for news items (3.0), editorial material (1.6) and reviews (1.4) compared to the overall average of 0.8 tweets per document. The success of these document types on Twitter suggests that publications which report on topical subjects and presents news and opinions in simpler and less technical language, are more likely to receive attention on Twitter. We also showed in another study that publications with shorter titles, fewer pages and references tend to receive more tweets.
So what do these trends mean for the interpretation of tweeted research?
When interpreting Twitter metrics for research, it’s important to keep in mind that the differences in document types, publication dates, disciplines, and whether or not a journal has a Twitter account can have a major effect on how frequently Twitter users will share it. Although we know that between one quarter to one third of scholarly publications are shared on Twitter, attention is not divided equally. Recent papers and preprints from health and social sciences as well as so-called ‘uncitable’ document types are tweeted more frequently than older publications and those from the hard sciences. With regards to tweets as scholarly metrics this means that Twitter activity should not be compared across publication years, disciplines and document types without appropriate normalization. The lack of or inconsistencies of article metadata makes it necessary to clean data or to add additional information from third parties who offer more standardized metadata.
When analyzing what type of scholarly publications get shared on Twitter, one also needs to consider that publishers now systematically diffuse their own contents on Twitter, representing a kind of ‘self-tweet’, which artificially increases tweet rates and Altmetric scores. Similar promotional efforts might happen on the author level, which might reinforce geographic biases towards papers authored by researchers in the US and UK. This is all to say that tweets linking to scholarly publications represent quite heterogeneous activities ranging from completely automated diffusion over intense discussions among scholars to engagement by non-academic Twitter users. My colleagues Rodrigo, Tim and I have tried to provide a theoretical framework for these acts and different levels of engagement.
Make sure to come back for next week’s post, which will explore where users sharing research articles come from and investigate whether Twitter metrics help to democratize research evaluation (spoiler alert: they don’t!).
 Altmetric has reduced the number of documents without publication dates to 2% in June 2018.
 In 2016 Altmetric collected metadata from a) publishers or platform webpages, b) publisher-specific APIs, c) CrossRef and other metadata stores and d) institutional repositories. Since early 2017 institutional data is no longer queried, as it was the source of many errors.
Stefanie Haustein is assistant professor at the University of Ottawa’s School of Information Studies, where she teach research methods and evaluation, social network analysis and knowledge organization. Her research focuses on scholarly communication, bibliometrics, altmetrics and open science. Stefanie co-directs, together with Juan Pablo Alperin, the #ScholCommLab, a research group that analyzes all aspects of scholarly communication in the digital age. Stefanie’s publications can be found on her website. She tweets as @stefhaustein.