Scraping PDF to something friendlier

Question

I've got several documents in PDF form that are almost exclusively transcripts. I'm looking for a way to search through these transcripts (and automate it) and essentially scrape the conversations/headers/etc. to raw data (things like "How many times did X say Y?")

Is there a way that I can convert PDF to a friendlier format (say, HTML or pseudo-HTML) where I can see exactly what's going on?

I'm currently using a scraper that will convert all of the included text into a txt file, which is useful, except that it throws out formatting (bolded statements, etc.) which would make life a lot easier.

Any way to look through PDFs in such a way with Python would be appreciated as well.

I take it these PDFs do not allow you to save them as text first? Some do... — RonaldBarzell, Dec 07 '12 at 23:57
Hmm... I'm not sure what you mean by that. I have PDFs in a folder. Is there a standard functionality to merely save them as text? I should clarify that these PDFs aren't 100% words. There are some pictures and tables, but these are largely (for now) irrelevant for my purposes. — River Tam, Dec 07 '12 at 23:59
Well, when I open some PDFs, I get the option to save them as text. Not all PDFs. I imagine it comes down to how they were generated. — RonaldBarzell, Dec 08 '12 at 00:00
Ah, I just tried it. It seems to work the same way as normally converting it to txt, which isn't terribly helpful. There are similar options which I'm exploring now but which aren't as readily available. — River Tam, Dec 08 '12 at 00:06
Have you tried this: http://www.pdfonline.com/convert-pdf-to-html/ — RonaldBarzell, Dec 08 '12 at 00:11
It's not bad, but it won't batch convert. I have like thousands to look for, and I'm not sure the CSS is very conducive (I have to reference the CSS to tell whether something is bold or something like that) — River Tam, Dec 08 '12 at 00:20
Well, there's got to be PDF parsers for Python. Of course you'd still have to generate the HTML markup, but if it's fairly regular markup, maybe....? — RonaldBarzell, Dec 08 '12 at 00:22
Have you looked at this? http://stackoverflow.com/questions/276434/converting-pdf-to-html-with-python — RonaldBarzell, Dec 08 '12 at 00:23

score 1 · Accepted Answer · answered Feb 08 '14 at 03:46

1

You can have a look at our open source library PDF2JSON. It converts all text data to JSON or XML so that you easier can inspect it

http://code.google.com/p/pdf2json

answered Feb 08 '14 at 03:46

FlowPaper Team

500
3
7

Scraping PDF to something friendlier

1 Answers1