I've got several documents in PDF form that are almost exclusively transcripts. I'm looking for a way to search through these transcripts (and automate it) and essentially scrape the conversations/headers/etc. to raw data (things like "How many times did X say Y?")
Is there a way that I can convert PDF to a friendlier format (say, HTML or pseudo-HTML) where I can see exactly what's going on?
I'm currently using a scraper that will convert all of the included text into a txt file, which is useful, except that it throws out formatting (bolded statements, etc.) which would make life a lot easier.
Any way to look through PDFs in such a way with Python would be appreciated as well.