1

I'm trying to extract text and numerical data from tables in PDF pages, as part of a digitization and data mining project.

The number of PDF documents to be processed exceeds 80,000 and each of them may have from 4 to 80 pages, combining images, texts, comments and several types of tables. I've successfully used PDFMiner to process the pages and collect most relevant data therefrom, but, after many weeks trying out different techniques, I have failed in getting faultless data from tables. Unfortunately these tables have several layouts: some of their columns are justified (justification adds white spaces between words), some cells have many lines, the line spacing varies throughout the table and there are multi-column cells as well. The parameters of character margins, line margins and word margins that yield the best performance for common pages deliver messy results when applied to table processing. Fortunately 3/4 of tables have vertical and horizontal lines that can be used to split their area into cells and to find the coordinates of each cell. However, the LTText instances brought by pdfminer.pdfinterp.PDFPageInterpreter and pdfminer.converter.PDFPageAggregator often do not respect the boundaries of each cell. I've spent many days in trying different techniques, including changes to laparams and string interpretation and splitting, to get and use the LTText instances generated by whole page processing. Something really useful would replace interpreter.process_page(page) by interpreter.process_cell(page, xmin, ymin, xmax, ymax)

I believe that a solution may exist if there is some method using the PDFMiner functions and methods to fetch the objects which are enclosed within the cell boundaries, using conservative laparameters to avoid messy results. The ideal method should be fast enough, because it has to be applied many times. When searching in StackOverflow, I found Extracting text from PDF page's certain areas?, which is similar, but it was not answered. I found also Extract area from pdf and Extract PDF text by coordinates, which employ other libraries and techniques. I wouldn't like to mix different libraries and their objects to tackle the problem, because PDFMiner has indeed been very efficient in recovering all pieces of information other than tables. Does anyone have suggestions?

A.Barata
  • 11
  • 2

0 Answers0