Silent Spring Institute

subscribe via RSS, or follow us on Github.

Developer Blog

Unlocking Data in PDFs

our workflow for scraping data

Unfortunately, there is a lot of data released in on the web in the form of PDF files. Scraping data out of PFDs is much harder than scraping from a web page; web pages have structure, in the form of HTML, that you can usually leverage to extract structured data.

It isn’t hopeless, it’s just harder. Here are some of the tools and techniques that we’ve found useful in parsing data from PDFs.

pdftotext

pdftotext is a utility from the Xpdf project that converts PDFs to flat text files. It is easiest to install and use on unix based platforms, where it can be found in the poppler-utils package. There is also a windows port of Xpdf that I’ve used successfully.

pdftotext has several useful flags that effect how it parses its input:

tabula

Recently I’ve switched to Tabula for most of my PDF scraping needs. Tabula is a desktop application for extracting data from PDFs. I’ve found it to be more reliable than pdftotext. The only drawback is that it isn’t a command line program, so automating the scraping isn’t as easy as pdftotext. On the other hand, you can visually select the parts of the PDFs you’d like to scrape, which is useful for one-off jobs.