Unlocking Data in PDFs
our workflow for scraping data
Unfortunately, there is a lot of data released in on the web in the form of PDF files. Scraping data out of PFDs is much harder than scraping from a web page; web pages have structure, in the form of HTML, that you can usually leverage to extract structured data.
It isn’t hopeless, it’s just harder. Here are some of the tools and techniques that we’ve found useful in parsing data from PDFs.
pdftotext is a utility from the
Xpdf project that converts PDFs to flat text files. It is easiest to install and use on unix based platforms, where it can be found in the
poppler-utils package. There is also a windows port of
Xpdf that I’ve used successfully.
pdftotext has several useful flags that effect how it parses its input:
Recently I’ve switched to Tabula for most of my PDF scraping needs. Tabula is a desktop application for extracting data from PDFs. I’ve found it to be more reliable than
pdftotext. The only drawback is that it isn’t a command line program, so automating the scraping isn’t as easy as
pdftotext. On the other hand, you can visually select the parts of the PDFs you’d like to scrape, which is useful for one-off jobs.