7

I have been endlessly searching for a tool that can extract text from a PDF while maintaining structure. That is, given a text like this:


Title

Subtitle1

Body1

Subtitle2

Body2


OR


Title

Subtitle1. Body1

Subtitle2. Body2


I want a tool that can output a list of titles, subtitles and bodies. Or, if anybody knows how to do this, that would also be useful :)

This would be easier if these 3 categories would be in the same format, but sometimes the subtitles can be bold, italic, underlined, or a random combination of the 3. Same for the titles. The problem with simple parsing from HTML/PDF/Docx is that these texts have no standard, and so quite often we can encounter sentences divided in several tags (in the case of HTML) and being a really hard to parse. As you can see, the subtitles are not always above a given paragraph or are sometimes in bullet points. So many possible combinations of formatting...

So far I have encountered similar inquiries in here using Tesseract and here using OpenCV, yet none of them quite answer my question.

I know that there are some machine learning tools to extract "Table of Contents" sections from scientific papers, but that also does not cut it. Does anyone know of a package/library, or if such thing has been implemented yet? Or does anyone know an approach to solve this problem, preferably in Python?

Thank you!

Edit:

The documents I am refering to are 10-Ks from companies, such as this one https://www.sec.gov/Archives/edgar/data/789019/000119312516662209/d187868d10k.htm#tx187868_10 And say, I want to extract Item 7 in a programmatic and structured way as I mentioned above. But not all of them are standardized to do HTML parsing. (The PDF document is just this HTML saved as a PDF)

Community
  • 1
  • 1
  • Do you mean pdf text extraction or ocr pdf images? – zindarod Jul 09 '18 at 22:29
  • Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow. One of your "related" questions contained actual code; the other is from years ago when there were not that many tool requests, and to-day it should swiftly be closed as well. – Jongware Jul 10 '18 at 15:18
  • @zindarod I am refering to pdf text extraction - these are pdfs that have been parsed from html documents, so they contain text. But treating it as a text extraction problem hasn't worked, hence my search for OCR tools. – Daniel Firebanks-Quevedo Jul 10 '18 at 15:18
  • @usr2564301 I will reframe my question, thank you – Daniel Firebanks-Quevedo Jul 10 '18 at 15:20
  • There are PDF text extraction modules written in Python (e.g., [PyMuPDF](http://pymupdf.readthedocs.io/en/latest/app2/)). But you say the problem is that there's no standard to the titles, sub-titles, and bodies so how do you intend to get this information programmatically? What are the outlines of the algorithm you have in mind? Also, if any of these PDFs are accessible online it may be helpful to link to them. – J. Owens Jul 10 '18 at 22:38
  • @J.Owens Thank you, I just linked an example. The way I intended to get this programmatically is to either detect all the possible different formattings and assume that some titles/subtitles are in different fonts than the bodies, or to have a model be trained on a sample so that it can recognize structure - just general ideas – Daniel Firebanks-Quevedo Jul 11 '18 at 13:22
  • The link you provided is to a HTML file, not a PDF. – mkl Jul 11 '18 at 16:41
  • @mkl The PDF document is just the HTML file saved as a PDF – Daniel Firebanks-Quevedo Jul 11 '18 at 17:47

3 Answers3

2

There is a lot of coding to do here, but let me give you a description of what I would do in Python. This is based on there being some structure in terms of font size and style:

  1. Use the Tesseract OCR software (open source, free), use OEM 1, PSM 11 in Pytesseract
  2. Preprocess your PDF to an image and apply other relevant preprocessing
  3. Get the output as a dataframe and combine individual words into lines of words by word_num
  4. Compute the thickness of every line of text (by the use of the image and tesseract output)
    • Convert image to grayscale and invert the image colors
    • Perform Zhang-Suen thinning on the selected area of text on the image (opencv contribution: cv2.ximgproc.thinning)
    • Sum where there are white pixels in the thinned image, i.e. where values are equal to 255 (white pixels are letters)
    • Sum where there are white pixels in the inverted image
    • Finally compute the thickness (sum_inverted_pixels - sum_skeleton_pixels) / sum_skeleton_pixels (sometimes there will be zero divison error, check when the sum of the skeleton is 0 and return 0 instead)
    • Normalize the thickness by minimum and maximum values
  5. Get headers by applying a threshold for when a line of text is bold, e.g. 0.6 or 0.7
  6. To distinguish between different a title and subtitle, you have to rely on either enumerated titles and subtitles or the size of the title and subtitle.
    • Calculate the font size of every word by converting height in pixels to height in points
    • The median font size becomes the local font size for every line of text
  7. Finally, you can categorize titles, subtitles, and everything in between can be text.

Note that there are ways to detect tables, footers, etc. which I will not dive deeper into. Look for research papers like the one's below.

Relevant research papers:

Casper Hansen
  • 443
  • 3
  • 10
1

There are certain tools that can accomplish your requested feature upto a certain extent. By saying "certain extent", I mean that the headings and title font properties will be retained after the OCR conversion.

Take a look at Adobe's Document Cloud platform. It is still in the launch stage and will be launching in early 2020. However, developers can have early access by signing up for the early access program. All the information is available in the following link:

https://www.adobe.com/devnet-docs/dcsdk/servicessdk/index.html

I have personally tried out the service and the outputs seem promising. All heading and title cases get recognised as it is in the input document. The micro service that offers this exact feature is "ExportPDF" service that converts a scanned PDF document to Microsoft Word document.

Sample code is available at: https://www.adobe.com/devnet-docs/dcsdk/servicessdk/howtos.html#export-a-pdf

Karthick Mohanraj
  • 1,565
  • 2
  • 13
  • 28
  • 1
    THANK YOU for this! At the moment they offer 1.000 calls for free, and then there is an undisclosed fee to be paid. Anyway, I got stuck at the "Generating personalized code samples" step. This is how I solved it: https://medium.com/@netpalantir/adobe-document-services-stuck-on-creating-credentials-4ef3e1d3a614 – Palantir Feb 24 '21 at 13:19
0

I did some research and experiments on this topic, so let me try giving a few of the hints I got from the job, which is still far from perfect.

I haven't found any reliable library to do it, although having the time and possibly the competences (I am still relatively inexperienced in reading other's code) I would have liked checking some of the work out there, one in particular (parsr).

I did reach some decent results in headers/title recognition by applying filters to Tesseract's hOCR output. It requires extensive work, i.e.

  1. OCR the pdf
  2. Properly parse the resulting hOCR, so that you can access its paragraphs, lines and words
  3. Scan each line's height, by splitting their bounding boxes
  4. Scan each word's width and height, again splitting bounding boxes, and keep track of them
  5. Heights are needed to intercept false positives, because line heights are sometimes inflated
  6. Find out the most frequent line height, so that you have a baseline for the general base font
  7. Start by identifying the lines that have height higher than the baseline found in #6
  8. Eliminate false positives checking if there a max height of the line's words that matches the line's one, otherwise use the max word height of each line to compare against the #6 baseline.
  9. Now you have a few candidates, and you want to check that a. The candidate line does not belong to a paragraph whose other lines do not respect the same height, unless it's the first line (sometimes Tesseract joins the heading with the paragraph). b. The line does not end with "." or "," and possibly other markers that rule out a title/heading

The list runs quite a bit longer. E.g. you might want to apply also some other criteria like comparing same word widths: if in a line you find more than a certain number of words (I use >= 50%) that are larger than average -- compared to the same word elsewhere in the document -- you almost certainly have a good candidate header or title. (Titles and headers typically have words that appear also in the document, often multiple times)

Another criteria is checking for all caps lines, and a reinforcement can be single liners (lines that belong to a paragraph with just one line).

Sorry I can't post any code (*), but hopefully you got the gist.

It's not exactly an easy feat and requires a lot of work if you don't use ML. Not sure how much ML would make it faster either, because there's a ton of PDFs out there, and probably the big guys (Adobe, Google, Abbyy, etc) trained their models for quite a while.

(*) My code is in JS, and it's seriously intertwined in a large converting application, which so far I can't post open source. I am reasonably sure you can do the job in Python, although the JS DOM manipulation might be somewhat an advantage there.

Giampaolo Ferradini
  • 529
  • 1
  • 6
  • 17