6

I am trying to extract information from some tables in a pdf document.
Consider the input:

Title 1
some text some text some text some text some text
some text some text some text some text some text

Table Title
| Col1          | Col2    | Col3    |
|---------------|---------|---------|
| val11         | val12   | val13   |
| val21         | val22   | val23   |
| val31         | val32   | val33   |

Title 2
some more text some more text some more text some more text
some more text
some more text some more text some more text some more text

I can get the outlines/titles as such:

path='myFile.pdf'
# Open a PDF file.
fp = open(path, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, '')
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:
    print (level, title)

This gives me:

(1, u'Title 1')
(2, u'Table Title')
(1, u'Title 2')

Which is perfect, as the levels are aligned with the text hierarchy. Now I can extract the text as follows:

if not document.is_extractable:
    raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
text_from_pdf = open('textFromPdf.txt','w')
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
    layout = device.get_result()
    for element in layout:
        if isinstance(element, LTTextBox):
            text_from_pdf.write(''.join([i if ord(i) < 128 else ' '
                                            for i in element.get_text()]))

Which gives me:

Title 1
some text some text some text some text some text some text some text
some text some text some text some text some text some text some text
Table Title
Col1
val11
val12
val13
Col2
val21
val22
val23
Col3
val31
val32
val33
Title 2
some more text some more text some more text some more text
some more text
some more text some more text some more text some more text

Which is a bit weird as the table is extracted in a column-wise fashion. Would it be possible for me to get the table row by row? Moreover, how can I identify where a table begins and ends?

martineau
  • 119,623
  • 25
  • 170
  • 301
AbtPst
  • 7,778
  • 17
  • 91
  • 172
  • 2
    If you can extract the table column-by-column and store it into a 2D `list` (list-of-lists), then you should be able to transpose that to get it into a row-by-row format. This is often done with the built-in [`zip()`](https://docs.python.org/3/library/functions.html#zip) function. As for finding the end of table, you'll need to see if you can detect some sort of change in formatting. – martineau Sep 14 '17 at 15:37
  • thanks, but the problem is i dont know where the table begins. any title in my document could indicate a table. how do i know? – AbtPst Sep 14 '17 at 19:12
  • 1
    There may be a pattern to how tables are constructed if there's only one source of the pdf documents. If you can figure that out your code and watch for it. Unfortunately, I don't think pdf files have any kind of formal "table" element, so doing something like that may be your only recourse... – martineau Sep 14 '17 at 20:32
  • thanks, makes sense. i'll have to devise a strategy depending on my data. – AbtPst Sep 15 '17 at 03:30

2 Answers2

4

If you only want to extract tables from PDF documents, then look at this answer: How to extract table as text from the PDF using Python?

From that answer, I have tried tabula-py which worked for me with tables of figures spread over multi-page PDF. tabula-py skipped properly all the headers and footers. Previously I had tried PDFMiner on this same type of document, and I had the same problem you mentioned, and sometimes even worse.

Vincent Agami
  • 1,168
  • 7
  • 9
4

Use camelot for extracting tables from pdfs