I am looking for a solution to extract both text and tables out of a PDF file. While some packages are good for extracting text, they are not enough good to extract tables.
One solution would be using Azure Form Recognizer Layout Model, but it fails when we have a mix of text and table, in particular when tables are kind of text format and they mix contents of tables and text together (please see Azure Form Recognizer code https://github.com/Azure-Samples/cognitive-services-quickstart-code/blob/master/python/FormRecognizer/rest/python-train-extract.md).
I tried pypdf2 and pdfplumber as well; here is the code for pypdf2:
import PyPDF2 data_path = "directory/to/pdf/files" texts = [] for fp in os.listdir(data_path): pdfFileObj = open(os.path.join(data_path, fp), 'rb') print(pdfFileObj) # pdfreader=PyPDF2.PdfFileReader(pdfFileObj) # count=pdfreader.numPages # text = " " for i in range(count): page = pdfreader.getPage(i) text += page.extractText() texts.extend([text])
First, pypdf2 works not bad for some pdf files, but it fails and does not preserve spaces between words for some pdfs like (pdf file from https://www.researchgate.net/publication/342920307_Using_Topic_Modeling_Methods_for_Short-Text_Data_A_Comparative_Analysis):
Second how I can extract tables if exist in a page? pdfplumber can extract both text and tables using extract_text() and extract_table() comments. It fails in preserving spaces between words for some documents. It also fails when we have double column pdf files as experienced.
Tabula is another alternative, but good with tables as I see from their website https://github.com/tabulapdf/tabula. My end question is what is the best practices to extract both contents, text and tables, out of pdf files given single column or double column pages.