“On weekdays we typically process around 400,000 stories … 1% will be about research”
We’ve been looking into ways to improve our news collection and analysis algorithms recently and realized that there aren’t a lot of good baseline statistics on how scholarly research is picked up and reported on by the media.
We’ve all probably got intuitive answers to questions like “is academic research written about more than football?” or “which outlets have the most original research reporting, and which just publish press releases?” but it’d nice to have some data: so here some is. I’d love to work with somebody more familiar with how science journalism works to dig a bit deeper, and equally if you’re interested in taking the raw data and doing your own research just get in touch!
Let’s start with some definitions:
- “scholarly research” in this case is an output – a book, article, dataset, clinical trial record etc. – that has a recognized identifier like a DOI, arXiv ID, ISBN or handle associated with it.
- The news stories we’ll look at are all text and online; they could be from a newspaper, magazine or online news site but we’re not considering video or audio stories, or stories that appear only in print.
- We’ll only consider a news story to be about an output if you can easily identify the output in question. It needs to contain either a link to a website where the output is found, an identifier (like an ISBN or DOI) or at least one author name and the journal name.
… these obviously won’t capture all types of reporting. If Alice Smith cures cancer and a journalist from The Guardian covers it, but doesn’t mention any of her papers, then we’d ignore it. This sucks: but we need to draw a line between stories about research and stories that happen to mention some aspect of research somehow.
It’s also worth noting that it probably results in us undercounting research in the arts & humanities, except where that research ends up in an output like a monograph or journal article.
On weekdays we typically process around 400,000 stories from 2,300 different outlets (we manually curate these: we license the same feeds as many media monitoring companies but found blogs and journal RSS feeds in the sources lists, which we handle separately). On weekends this drops to around 190,000 stories.
“70 – 80% will have a link of some sort”
Typically on weekdays 1% of stories – 4,000 – will be about research (remember the definitions above).
- ~ 2 – 3% will have ISBNs that we recognize as academic texts – these are likely to be book reviews.
- ~ 10 – 15% will have a DOI – these are likely to be press releases with the DOI listed at the bottom.
- ~ 70 – 80% will have a link of some sort to a research output or – often – the homepage of the journal where the research has been published.
- ~ 20% – 30% will have an author name and journal that we can then use to identify the output with a high degree of precision.
Note that those percentages don’t add up to 100%: that’s because a single story could have both a link and the author and journal names (or a DOI or ISBN).
Text mining using journal names is difficult. If you look for journal names in news stories then you’ll get a lot of results back: 30,000 out of the 400,000 daily articles. This is because for every Proceedings of the National Academics of Science of the United States of America (nice and unique) there’s a Brain, a Circulation and a Medicine (not so much). Not to mention Nature and Science.
You might wonder why we bother with text mining if up to 80% of news stories have a link. Unfortunately not all news outlets are created equal, as we’ll see later on in this post. Republished press releases are much more likely to contain a link than original journalism but authors and editors are more interested in the latter.
“Is the number of news stories that include a DOI increasing over time? No.”
Is the number of news stories that include a DOI increasing over time? No. I wish it was, but this time last year the percentage was pretty much the same.
In fact the proportion of stories about research in general has remained fairly static – between 1% and 1.2% – since last year.
What about the quality of reporting? This is too big a question to try and tackle in a blog post, but there are some simple indicators we can look at to get an idea of how much editorial or journalistic effort is involved in the research reporting at different outlets.
The majority of news stories on the web that cover research are either syndicated or a press release. Of 2,539,706 recent news stories we looked at 1,214,132 (47%) were “duplicates” – the same title and first couple of sentences appeared in at least one other outlet. That doesn’t cover outlets that edit the title or lede of a story but nothing else, so the real figure is almost certainly higher. This is to be expected – think about all of the online news sites that exist for specific verticals or regions. There are lots of them and they don’t create all their own content.
Here’s a treemap of all the outlets we track (you may need to click on it to see the individual cells). The larger the cell, the more research stories the outlet has published. The colour represents how many of the stories from that outlet are duplicates. Red means lots of duplicates, green means hardly any duplicates:
You can see the sources that contribute the most articles to our database: EurkeAlert! is the largest (this may not come as a surprise to anybody working in science: it’s an aggregator and distributor of press releases from journals and institutions). In the middle of the treemap you can see a line of red squares with labels like “WFMJ” “WTOL” and “WRCB-TV” – these are local TV stations affiliates that have an online news site. They’re dark red because all of their content is duplicated on the other affiliates in their network.
I was happy to see The Conversation is the second biggest source of research news that we track. They’re a free, professionally edited news site that solicits contributions from working academics: so all of their stories are written by an expert in that field. As they’re authored by academics we track them as outputs as well as news stories – you can read a bit more about that here.
The distribution of research stories by outlet is a long tail – around 100 outlets (out of the 2,300 we’re tracking) are responsible for 50% of the stories about research found. You can see this in the treemap – it’s ordered by size from left to right, moving in columns downwards. You can see that the left hand side is made up of big, prolific outlets and the right hand side many more smaller outlets.
“Around 100 outlets are responsible for 50% of the stories about research found”
You can pick out some well known outlets from the visualization – the Guardian and the Independent from the UK, CNN and The Atlantic from the US. They’re all green, meaning their content is mostly original. Could we distinguish between “high quality” outlets that produce original content and more general press release aggregators, or outlets that rely on syndicated content?
We ask customers to help “tier” news sources, to help determine which ones are the most important to them. You would think that the importance of a news source is super subjective: French researchers care more about Le Monde than The Los Angeles Times, and some scientists will be more thrilled by being mentioned in Scientific American than in BMX World. Surprisingly in general there’s a lot of agreement – though this could say more about our customer base than anything else.
Here’s a map of “top tier” news sources. Again, green means fewer duplicates and red means more:
It’s pretty green, especially compared to the next tier down:
So – not really a surprise – it appears that original content is one of the hallmarks of sources that people feel are important.
It’s hard to compare the proportion of research stories to that of stories about other topics, like politics or sport, but we can look at what fields of research get talked about the most in the news. With some helpful technology from our sister company UberResearch we can automatically assign Fields of Research (FoR) codes to research outputs, so we can see if cancer, cosmology or archaeology get mentioned more – in a future post I’ll show those results!