PubMed is a popular search engine for biomedical literature. It has lost a lot of ground to Google Scholar over the past few years, but for a long time it was the go-to scientific search engine for psychologists, neuroscientists, and the likes. And the cool thing is that, unlike Google Scholar, PubMed allows you to write scripts to automatically download enormous amounts of information.
Which is what I did over the weekend: I downloaded information about scientific articles. Names, authors, abstracts (summaries), journal titles, etc. And lots of it. I figured I would eventually get banned for abusing the PubMed service, but I didn't, and the end result is a database containing 257.535 articles published between 1950 and 2010 in 43 academic journals, broadly focused on neuroscience and cognitive psychology [1]. To the extent that PubMed has a complete index, this should include a large proportion of all articles published between those years in those journals.
So that's a lot of data!
I'm planning to write a series of blog posts, each time focusing on a different aspect of this data set. My main aim will be to understand the whole system academic publishing just a little bit better, and to see how it has evolved over the years. All the while keeping in mind, of course, that even these quarter of a million articles reflect just a tiny fraction of the total volume of scientific output. And a biased fraction at that, because the journals have been hand-picked …