2

This is literally day 1 of python for me. I've coded in VBA, Java, and Swift in the past, but I am having a particularly hard time following guides online for coding a pdf scraper. Since I have no idea what I am doing, I keep running into a wall every time I want to test out some of the code I've found online.

Basic Info

  • Windows 7 64bit
  • python 3.6.0
  • Spyder3
  • I have many of the pdf related code packages (PyPDF2, pdfminer, pdfquery, pdfwrw, etc)

Goals

To create something in python that allows me to convert PDFs from a folder into an excel file (ideallY) OR a text file (from which I will use VBA to convert).

Issues

Every time I try some sample code from guides i've found online, I always run into syntax errors on the lines where I am calling the pdf that I want to test the code on. Some guide links and error examples below. Should I be putting my test.pdf into the same file as the .py file?

runfile('C:/Users/U587208/Desktop/pdffolder/pdfminer.py', wdir='C:/Users/U587208/Desktop/pdffolder')
  File "C:/Users/U587208/Desktop/pdffolder/pdfminer.py", line 79
    print pdf_to_csv('test.pdf', separator, threshold)
                   ^
SyntaxError: invalid syntax
kidusk
  • 21
  • 1
  • 2
  • (1) you need another set of parentheses, ie `print(pdf_to_csv('test.pdf', separator, threshold))` because in Python 3 `print` is a function; (2) this will be dependent on the exact structure of your pdf file; pdf is a page layout format, not a data description format, so you could have a bit of a rough time. – Hugh Bothwell Jun 12 '17 at 16:12
  • Hugh, Is there another method you would recommend? Or maybe a good resource for figuring things out? – kidusk Jun 12 '17 at 16:43

3 Answers3

1

It seems that the tutorials you are following make use of python 2. There are usually few noticable differences, the the biggest is that in python 3, print became a funtion so

print()

I would recomment either changing you version of python or finding a tutorial for python 3. Hope this helps

cbolles
  • 475
  • 5
  • 17
0

Here Pdfminer python 3.5 an example, how to extract informations from a PDF. But it does not solve the problem with tables you want to export to Excel. Commercial products are probably better in doing that...

pyano
  • 1,885
  • 10
  • 28
0

I am trying to do this exact same thing! I have been able to convert my pdf to text however the formatting is extremely random and messy and I need the tables to stay in tact to be able to write them into excel data sheets. I am now attempting to convert to XML to see if it will be easier to extract from. If I get anywhere on this I will let you know :)

btw, use python 2 if you're going to use pdfminer. Here's some help with pdfminer https://media.readthedocs.org/pdf/pdfminer-docs/latest/pdfminer-docs.pdf

Jess
  • 11
  • 2