3

I am looking to scrape information from the this PDF into the following format:

enter image description here

I have circled the areas in the PDF where the information will come from.

As you can see, the formatting of this PDF is highly unstructured and to make matters worse, different PDFs can come in completely different layouts and there will also be missing information. It is already hard for a human unfamiliar with mining to be able to parse this PDF as not all the information is clearly labelled.

So my question: Is it even possible to come up with an automated approach to process thousands of PDFs like this? If so, how would I begin to approach this task? I can program pretty well in R and Python.

I realise this is a pretty difficult (if not impossible) task. Thanks for your input.

Artjom B.
  • 61,146
  • 24
  • 125
  • 222
mchangun
  • 9,814
  • 18
  • 71
  • 101
  • 1
    Extracting text from PDF with Python. http://stackoverflow.com/questions/1848464/advanced-pdf-parsing-using-python-extracting-text-without-tables-etc-whats – AbsoluteƵERØ Jun 14 '13 at 06:23
  • 4
    *highly unstructured* + *completely different layouts* = **Without a real AI there is no automated approach.** The best you can hope for is some half-automated approach, i.e. some application showing the PDF and asking the user to mark the data items to extract one after the other. – mkl Jun 14 '13 at 07:53
  • @mkl Thanks for your reply. "Without a real AI..." - what kind of AI are we talking about here? Even with AI and Machine Learning, is this task feasible? If so, what are the techniques? I know bits of SVM, NN etc etc but don't really see how they can help here. – mchangun Jun 14 '13 at 08:28
  • 2
    For *somewhat unstructured* + *in details different layouts* I assume some current machine learning approach would do. For *highly unstructured* + *completely different layouts,* though, I'm afraid you'll need **HAL 9000.** – mkl Jun 14 '13 at 08:42
  • @mkl =( That's what I feared. – mchangun Jun 14 '13 at 08:52

2 Answers2

1

I don't think this is as difficult as people think. I agree it's not going to be 100% accurate but surely you just factor in the potential inaccuracy. I don't suppose that humans are 100% accurate either.

So I would suggest that you use a PDF library to extract the text and then use a set of keyword matches to try and find appropriate information. For each keyword you extract mark the original PDF perhaps using a red circle like in your example PDF.

Then in the final ouput store not only the data but also the PDF so that people can look to check the data and override the values if appropriate. Periodically you would need to check the overridden values and adapt your heuristics to cope better.

You would also need a test bed so that you could store your thousands of test documents and validate any code changes against your existing knowledge base. That gives you the confidence to change things and be reasonably certain you've not broken anything crucial.

My answers may feature concepts based around ABCpdf. It's what I work on. It's what I know. :-)

0

I couldn't see your PDF, the link may be broken. But for extracting data from unstructured PDFs please consider using pdftotext for converting the pdf into plain text:

pdftotext -layout {PDF-file} {text-file}

And then use a small python package I created when I was facing a similar problem. I'm an amateur programer so the library may be a little 'dirty' and I may contain some bugs. You can install it via pip:

sudo pip install MassTextExtractor

And you can see an example of it's use in this answer.

Community
  • 1
  • 1
Rui Lima
  • 227
  • 1
  • 4
  • 17