I am looking to scrape information from the this PDF into the following format:
I have circled the areas in the PDF where the information will come from.
As you can see, the formatting of this PDF is highly unstructured and to make matters worse, different PDFs can come in completely different layouts and there will also be missing information. It is already hard for a human unfamiliar with mining to be able to parse this PDF as not all the information is clearly labelled.
So my question: Is it even possible to come up with an automated approach to process thousands of PDFs like this? If so, how would I begin to approach this task? I can program pretty well in R and Python.
I realise this is a pretty difficult (if not impossible) task. Thanks for your input.