5

I'm trying to get the data from the tables in this PDF. I've tried pdfminer and pypdf with a little luck but I can't really get the data from the tables.

This is what one of the tables looks like: enter image description here

As you can see, some columns are marked with an 'x'. I'm trying to this table into a list of objects.

This is the code so far, I'm using pdfminer now.

# pdfminer test
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter, PDFPageAggregator
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure, LTImage
from pdfminer.image import ImageWriter
from cStringIO import StringIO
import sys
import os


def pdfToText(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ''
    maxpages = 0
    caching = True
    pagenos = set()

    records = []
    i = 1
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,
                                  caching=caching, check_extractable=True):
        # process page
        interpreter.process_page(page)

        # only select lines from the line containing 'Tool' to the line containing "1 The 'All'"
        lines = retstr.getvalue().splitlines()

        idx = containsSubString(lines, 'Tool')
        lines = lines[idx+1:]
        idx = containsSubString(lines, "1 The 'All'")
        lines = lines[:idx]

        for line in lines:
            records.append(line)
        i += 1

    fp.close()
    device.close()
    retstr.close()

    return records


def containsSubString(list, substring):
    # find a substring in a list item
    for i, s in enumerate(list):
        if substring in s:
            return i
    return -1


# process pdf
fn = '../test1.pdf'
ft = 'test.txt'

text = pdfToText(fn)
outFile = open(ft, 'w')
for i in range(0, len(text)):
    outFile.write(text[i])
outFile.close()

That produces a text file and it gets all of the text but, the x's don't have the spacing preserved. The output looks like this: enter image description here

The x's are just single spaced in the text document

Right now, I'm just producing text output but my goal is to produce an html document with the data from the tables. I've been searching for OCR examples, and most of them seem confusing or incomplete. I'm open to using C# or any other language that might produce the results I'm looking for.

EDIT: There will be multiple pdfs like this that I need to get the table data from. The headers will be the same for all pdfs (s far as I know).

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
user
  • 715
  • 4
  • 13
  • 32

2 Answers2

3

I figured it out, I was going in the wrong direction. What I did was create pngs of each table in the pdf and now I'm processing the images using opencv & python.

user
  • 715
  • 4
  • 13
  • 32
  • 2
    Could you please describe the approach in a more detailed way? How did you extract the tables? Which type of image segmentation did you use? – sdk Nov 14 '16 at 13:15
  • it is an old post but can you please share how you achieved to get the images of a table in a pdf file with using opencv? – Rikkas Oct 02 '18 at 14:51
  • There is also Camelot, which is a python tool made to get tables from PDFs. https://github.com/socialcopsdev/camelot – james-see Mar 24 '19 at 21:14
  • 1
    @Saradhi Thank you, I will check that out – user Oct 08 '19 at 19:52
2

Give a try to Tabula and if it works use tabula-extractor library (written in ruby) to programatically extract the data.

matagus
  • 6,136
  • 2
  • 26
  • 39
  • Tabula almost worked. It sees most of the table but some of the x's are in the same cell together,. – user Jan 13 '15 at 18:20
  • it just works on textbased pdfs and not on images is there anything similiar to this where it can extract data from pdf images ? – Sundeep Pidugu Nov 30 '18 at 06:07