We added a cool new feature to the Explorer last week, though it’s hidden away a little: the ability to export the actual mentions of a set of outputs as CSV (comma separated values, suitable for loading in Excel and other spreadsheet applications).
Let’s use it to pull out some insights!
You’ve always been able to download and analyze counts and metadata about research outputs, like journal articles, datasets, book chapters and about journals. To get the actual news headlines, blog names and so on, though, you’d previously need to have used our API…. but while we’ve tried to make the API as straightforward to use as possible not everybody is a developer and sometimes it’s just easier to fire up a spreadsheet.
The new feature isn’t designed to help you analyze massive datasets: that’s still what the API is best at, and too many rows will crash Excel anyway. You can download a maximum of one million mentions at a time. Need more than that? Try using a more precise search, or filtering out the types of mention that you don’t need.
Let’s jump right into some examples of what you can do with the new data and some simple Excel-fu.
First off, create a search like normal. For our examples, we’re going to search for the keyword “schizophrenia”. This returns 24,019 outputs and 128,935 mentions (at the time of writing – it’ll be different for you if you’re following along! From this point on I’ll stick to approximate numbers). Let’s download them! Click on “Analyze These Results” on the right-hand sidebar, then choose the “Mentions” tab from the Results Analysis pop-up. You should see something like this:
Click on the “Export mentions data…” link, and around 40MB of CSV goodness will start downloading.
Now let’s play with the data! I’m not going to try and write any Excel tutorials here, but the critical thing to Google, if you aren’t already familiar with them, is “pivot tables”. These are a feature of Excel that let you filter, count and sort data grouped in different ways. The other thing you should know about is clicking on a column header to get summary stats in the bottom of the window, as shown below:
Load up your downloaded CSV file in Excel. Now select columns A through G inclusive (so everything from “Mention Type” to “Journal/Collection Title”. Create a new PivotTable by going to Insert -> PivotTable on the ribbon. By default it’ll create this table in a new worksheet: that’s fine.
Set up the Pivot Table fields as shown below. What we’re doing is grouping by the author of each mention, and giving ourselves the ability to drill down into specific mention types (tweets, news stories, blogs…) and journals.
Let’s try to get a feel for the Twitter users who tweet about schizophrenia research. First, let’s get a baseline. Choose “tweet” from the mention type filter, and get summary stats on the “Count of Outlet or Author” column. We can see that there are approximately 102,000 tweets in the dataset, from around 28,000 tweeters.
That implies an average of 3.5 tweets per account. Is that true though? Let’s check the distribution. Sort the count of author column from largest to smallest. Scrolling down, it looks like there’s a long tail: lots of accounts have only ever tweeted once, and a few have tweeted thousands of times. Let’s do some simple checks… first of all, what %age of tweeters are responsible for 50% of all tweets?
If we start at the top and select cells as we go down we can stop when the summary stats shows a sum of 51,000 (around half of tweet dataset). In my file, this is when the count of authors is 514, or 0.5% of the total number of authors. So 0.5% of tweeters are responsible for 50% of tweets in this set.
Surprised? You shouldn’t be, this kind of skew is fairly normal for Twitter. Partly it’s because there’s a mix of audiences on the site for research content, but mostly it’s because of bots. Let’s look at the ten most prolific tweeters in our set… they’re the first rows of the pivot table since we sorted by Count of Outlet or Author:
The first four are automated accounts and they alone are responsible for 17% of all the tweets in our set. After that, you start real people or organizations tweeting in a more organic way (@PaulWhiteleyPhD and @Keith_Laws are researchers – Keith Laws is an editor at BMC Psychology and PLoS One).
Let’s ignore the top four automated accounts to get rid of the skew they’re causing. Now we’re left with ~ 84k tweets, and ~ 4% of tweeters are responsible for 50% of them. We can use this as our baseline.
Does this proportion vary by journal? Let’s look at some and compare, using the Journal/Collection title filter and always ignoring the four prolific automated accounts above.
|Journal||Tweeters||50% of tweets from||Tweeted 3x from here|
|Schizophrenia Research||~ 3,760||23.0%||12.7%|
|British Journal of Psychiatry||~ 1,030||17.3%||17.3%|
|PLoS One||~ 894||28.5%||8.8%|
How to interpret these numbers? Well, with caution. A few assumptions are fairly safe to make though: more people tweet about schizophrenia articles in Schizophrenia Research than in PLoS One. The number of people who tweeted 3+ times from a single journal might indicate loyalty… but only if you adjust for the number of articles about schizophrenia published in each one. For each of the journals, there’s a “power user” segment making up around a quarter of the total, who are responsible for half of all the tweets about schizophrenia research.
Of course, it’s not all about metrics, qualitatively there are interesting things to note too. For example, none of the journals – despite all having Twitter accounts – are in their own ten most frequent tweeters lists, so it’s not the case that they’re tweeting all of the articles that they publish. In fact, Nature Reviews Neurology (@NatRevNeurol) has tweeted more schizophrenia research in PLoS One than the official PLoS One account has (but less than @PLOSNeuro, PLoS’ neuroscience community account).
There’s a lot of overlap in the most frequent tweeters lists, but in each case, there are two or three “top” tweeters that aren’t bots and aren’t in the top ten of the other journals. For PLoS One it’s @SebValenz (“Geek Scienceloving Psychologist from PUC SCL”), for the BJP it’s @SameiHuda (who also blogs at The Mental Elf) and for Schizophrenia Research it’s @morriseric (director of the La Trobe University Psychology Clinic, in Melbourne).
That’s it for now – next time we’ll use the same file to figure out time distributions: when do different types of mention appear after publication?
For more information on our new export mentions feature, why not register for our webinar on the 27th July: ‘Exporting mentions to uncover new insights: unpacking our latest feature’