This is a guest post contributed by Oliver Martell – a developer at Altmetric
In the last couple of months we’ve been busy shipping a lot of new product features and one of the main challenges as a new developer in the team has been understanding how all the systems work together to bake all those colourful doughnuts. So, today I will try to give you a simplified idea of what happens behind the scenes at Altmetric.
The core goal of our product is to track the online attention around scholarly literature. We want to help publishers, institutions and researchers to see how people are discussing their research articles. In order to achieve this we need to look for mentions to articles in different internet sources, such as twitter, facebook and news outlets. So the challenge here is; how can we tell if a tweet, a blog post or a news article is mentioning a research paper? This is where we take advantage of the hyperlinked nature of the web. We search inside the content of these digital expressions for links that lead to research articles.
Collecting data from internet sources, figuring out if a post is mentioning an academic paper and building the Altmetric score are just some of the tasks that we need to do in order to build the details page of an article. All this work is spread into different systems that collaborate in different ways to produce the final result that you get to see on the screen.
One of these systems, Weyland, maintains a curated list of the journal domains and news outlets that we’re tracking. This curated list contains domains like nature.com, acm.org, bmj.com and bbc.co.uk. We then use these domains in our collector systems to filter out the posts that are worth exploring in more detail.
The next part of the puzzle are the collectors. These are our little programs that interact with different APIs from internet data sources to get the actual tweets, comments, news and blog posts. All these types of social digital expressions are transformed into an internal uniform representation that we call post. The author’s profile is also stored in our profiles collection so that we can later use it to influence the Altmetric score and to show it in our details pages.
These posts and profiles are then processed by our mighty pipeline. The pipeline is our main system. All the collectors push posts to it, then the pipeline takes these posts and fetches the referenced web page links. After getting the content of a web page the system searches for things that look like references to articles – if it finds one then it means that we’ve found a mention. The article is now stored in our database so that we can later use it when we later get more mentions about the same article.
At this point in time we also start building our external representation of an article, which contains all its mentions, aggregated counts, demographics and geographical information. This representation along with its score gets updated every time the pipeline receives a new post from the collectors. These external articles are finally consumed by the web API and are later requested by the front-end system, which is in charge of serving the HTML pages that you see in our explorer and details pages.
If this sounds something you would like to help us build, join us!