Scraping PDF data into Excel absolute beginner

Question

This is literally day 1 of python for me. I've coded in VBA, Java, and Swift in the past, but I am having a particularly hard time following guides online for coding a pdf scraper. Since I have no idea what I am doing, I keep running into a wall every time I want to test out some of the code I've found online.

Basic Info

Windows 7 64bit
python 3.6.0
Spyder3
I have many of the pdf related code packages (PyPDF2, pdfminer, pdfquery, pdfwrw, etc)

Goals

To create something in python that allows me to convert PDFs from a folder into an excel file (ideallY) OR a text file (from which I will use VBA to convert).

Issues

Every time I try some sample code from guides i've found online, I always run into syntax errors on the lines where I am calling the pdf that I want to test the code on. Some guide links and error examples below. Should I be putting my test.pdf into the same file as the .py file?

How to scrape tables in thousands of PDF files?
- I got an invalid syntax error due to "for" on the last line
PDFMiner guide (Link)

runfile('C:/Users/U587208/Desktop/pdffolder/pdfminer.py', wdir='C:/Users/U587208/Desktop/pdffolder')
  File "C:/Users/U587208/Desktop/pdffolder/pdfminer.py", line 79
    print pdf_to_csv('test.pdf', separator, threshold)
                   ^
SyntaxError: invalid syntax

(1) you need another set of parentheses, ie `print(pdf_to_csv('test.pdf', separator, threshold))` because in Python 3 `print` is a function; (2) this will be dependent on the exact structure of your pdf file; pdf is a page layout format, not a data description format, so you could have a bit of a rough time. — Hugh Bothwell, Jun 12 '17 at 16:12
Hugh, Is there another method you would recommend? Or maybe a good resource for figuring things out? — kidusk, Jun 12 '17 at 16:43

score 1 · Answer 1 · answered Jun 12 '17 at 18:40

It seems that the tutorials you are following make use of python 2. There are usually few noticable differences, the the biggest is that in python 3, print became a funtion so

print()

I would recomment either changing you version of python or finding a tutorial for python 3. Hope this helps

score 0 · Answer 2 · answered Jun 12 '17 at 18:54

0

Here Pdfminer python 3.5 an example, how to extract informations from a PDF. But it does not solve the problem with tables you want to export to Excel. Commercial products are probably better in doing that...

answered Jun 12 '17 at 18:54

pyano

1,885
10
28

score 0 · Answer 3 · answered Jun 13 '17 at 13:42

I am trying to do this exact same thing! I have been able to convert my pdf to text however the formatting is extremely random and messy and I need the tables to stay in tact to be able to write them into excel data sheets. I am now attempting to convert to XML to see if it will be easier to extract from. If I get anywhere on this I will let you know :)

btw, use python 2 if you're going to use pdfminer. Here's some help with pdfminer https://media.readthedocs.org/pdf/pdfminer-docs/latest/pdfminer-docs.pdf

Scraping PDF data into Excel *absolute beginner*

3 Answers3

Scraping PDF data into Excel absolute beginner