2

I am extracting text from pdf files using python pdfminer library (see docs).

However, pdfminer seems unable to extract all texts in some files and extracts LTFigure object instead. Assuming from position of this object it "covers" some of the text and thus this text is not extracted.

Both pdf file and short jupyter notebook with the code extracting information from pdf are in the Github repository I created specifically in order to ask this question:

https://github.com/druskacik/ltfigure-pdfminer

I am not an expert on how pdf files work but common sense tells me that if I can look for the text using control + f in browser, it should be extractable.

I have considered using some other library but the problem is that I also need positions of the extracted words (in order to use them for my machine learning model), which is a functionality only pdfminer seems to provide.

druskacik
  • 2,176
  • 2
  • 13
  • 26

2 Answers2

1

Ok, so I finally came up with the solution. It's very simple - it's possible to iterate over LTFigure object in the same way you would iterate over e.g. LTTextBox object.

interpreter.process_page(page)
layout = device.get_result()

for lobj in layout:
    if isinstance(lobj, LTTextBox):
        for element in lobj:
            if isinstance(element, LTTextLine):
                text = element.get_text()
                print(text)

    elif isinstance(lobj, LTFigure):
        for element in lobj:
            if isinstance(element, LTChar):
                text = element.get_text()
                print(text)

Note that the correct way (as to make sure that the parser reads everything in the document) would be to iterate pdfminer objects recursively, as shown here: How does one obtain the location of text in a PDF with PDFMiner?

druskacik
  • 2,176
  • 2
  • 13
  • 26
  • May I ask what your "figure" is in reference to in your 'for element in figure:' loop please? I myself am trying to get some use out of LTFigure but not having much luck.... – PW1990 Jan 18 '22 at 05:41
  • 1
    Sorry, it should be `for element in lobj:`, of course. Fixed it. – druskacik Jan 18 '22 at 10:25
  • Now if only pdfminer would do the word and line grouping on Figure contents. I've got a figure that covers the entire page with 1200 LTChars inside of it, so I'll be re-creating pdfminer's word and line grouping logic to sort it out – user15741 Aug 16 '22 at 18:54
0

Given that you also consider other libraries, I suggest using poppler-util's pdftohtml to convert the pdf to xml:

!apt-get install -y poppler-utils
!pdftohtml -c -hidden -xml document.pdf output.xml

It will output an xml file with the text and top, left, width, and height values for the boxes. It had no issues with the text that pdfminer doesn't recognize.

RJ Adriaansen
  • 9,131
  • 2
  • 12
  • 26