Extract text and tables of a PDF file in Python

Question

I am looking for a solution to extract both text and tables out of a PDF file. While some packages are good for extracting text, they are not enough good to extract tables.

One solution would be using Azure Form Recognizer Layout Model, but it fails when we have a mix of text and table, in particular when tables are kind of text format and they mix contents of tables and text together (please see Azure Form Recognizer code https://github.com/Azure-Samples/cognitive-services-quickstart-code/blob/master/python/FormRecognizer/rest/python-train-extract.md).

I tried pypdf2 and pdfplumber as well; here is the code for pypdf2:

import PyPDF2
data_path = "directory/to/pdf/files"
texts = []

for fp in os.listdir(data_path):    
  pdfFileObj = open(os.path.join(data_path, fp), 'rb')
  print(pdfFileObj)
  #
  pdfreader=PyPDF2.PdfFileReader(pdfFileObj)      
  #
  count=pdfreader.numPages
  #
  text = " "
  for i in range(count):
      page = pdfreader.getPage(i)
      text += page.extractText()

  texts.extend([text])

First, pypdf2 works not bad for some pdf files, but it fails and does not preserve spaces between words for some pdfs like (pdf file from https://www.researchgate.net/publication/342920307_Using_Topic_Modeling_Methods_for_Short-Text_Data_A_Comparative_Analysis):

Second how I can extract tables if exist in a page? pdfplumber can extract both text and tables using extract_text() and extract_table() comments. It fails in preserving spaces between words for some documents. It also fails when we have double column pdf files as experienced.
Tabula is another alternative, but good with tables as I see from their website https://github.com/tabulapdf/tabula. My end question is what is the best practices to extract both contents, text and tables, out of pdf files given single column or double column pages.

You may try ww.algodocs.com, which has free subscription. With algodocs you can extract both - text and tables from system-generated pdfs and scanned images even with poor quality. See https://www.algodocs.com/blog/extract-tables-from-scanned-pdfs-and-images-with-low-quality — Zhavat, Oct 14 '21 at 12:53
Thanks Zhavat, great tool, but looks like it is not an open-source tool and has no python source code available. — Sam S., Oct 15 '21 at 01:00

score 2 · Answer 1 · answered Mar 29 '22 at 05:05

2

You could try and follow this guide to extract text, tables and also images from the PDF. It uses both PyPDF and tabula-py to do the work, but I'm not sure that you can extract it sequentially since you're doing "multiple" extractions of the same pdf file.

answered Mar 29 '22 at 05:05

Mhackiori

76
1
6

1

Thanks Francesco, a combination of those tools might be a good solution, but we might have a challenge when the pdf file is a little bit complex, like having both double and single-column pages in the pdf. The order of contents is another challenge when using combination tools. – Sam S. Mar 29 '22 at 21:55
I think that maybe with other PDF extraction libraries like pdfplumber you could be able to extract double column files like in this other answer: https://stackoverflow.com/questions/55100037/how-to-extract-text-from-two-column-pdf-with-python – Mhackiori Mar 30 '22 at 06:41

kd4ttc · Answer 2 · 2021-09-21T02:32:09.317

1

The answer depends if the question is general or specific to a single form. Your approach is reasonable for the general case, but there will be variability. If you have a pdf form that is a single form or report that has been created with different data at each iteration consider converting the form from pdf to postscript then see if you can parse the postscript.

Two utilities do this: pdf2ps and pdftops Try each. This approach may benefit if you know some postscript. With some luck the needed fields may be simple text strings. Worth a try.

edited Sep 21 '21 at 02:32

answered Sep 21 '21 at 02:17

kd4ttc

1,075
1
10
28

Thanks for the answer, is it possible for you to add some python sample codes to try how it works with the conversion of pdf to postscript? – Sam S. Sep 21 '21 at 02:35
https://stackoverflow.com/questions/45104505/python-convert-pdf-files-to-eps: from subprocess import call call(["pdf2ps", "input.pdf", "output.eps"]) – Sam S. Sep 21 '21 at 02:36
The utilities are run from the command line. @SamS gives a from python approach. You can also write a shell script to process the files and pipe the output into your python program. – kd4ttc Oct 01 '21 at 00:20

Extract text and tables of a PDF file in Python

2 Answers2