I have been endlessly searching for a tool that can extract text from a PDF while maintaining structure. That is, given a text like this:
Title
Subtitle1
Body1
Subtitle2
Body2
OR
Title
Subtitle1. Body1
Subtitle2. Body2
I want a tool that can output a list of titles, subtitles and bodies. Or, if anybody knows how to do this, that would also be useful :)
This would be easier if these 3 categories would be in the same format, but sometimes the subtitles can be bold, italic, underlined, or a random combination of the 3. Same for the titles. The problem with simple parsing from HTML/PDF/Docx is that these texts have no standard, and so quite often we can encounter sentences divided in several tags (in the case of HTML) and being a really hard to parse. As you can see, the subtitles are not always above a given paragraph or are sometimes in bullet points. So many possible combinations of formatting...
So far I have encountered similar inquiries in here using Tesseract and here using OpenCV, yet none of them quite answer my question.
I know that there are some machine learning tools to extract "Table of Contents" sections from scientific papers, but that also does not cut it. Does anyone know of a package/library, or if such thing has been implemented yet? Or does anyone know an approach to solve this problem, preferably in Python?
Thank you!
Edit:
The documents I am refering to are 10-Ks from companies, such as this one https://www.sec.gov/Archives/edgar/data/789019/000119312516662209/d187868d10k.htm#tx187868_10 And say, I want to extract Item 7 in a programmatic and structured way as I mentioned above. But not all of them are standardized to do HTML parsing. (The PDF document is just this HTML saved as a PDF)