How to convert a pdf document to an excel spreadsheet:
Option 1, using the pdf_tables API:
- Install pdf_tables with
pip install git+https://github.com/pdftables/python-pdftables-api.git
- Get an account here
Once you have everything installed you can run this code:
import pdftables_api
c = pdftables_api.Client('my-api-key')
c.xlsx('input.pdf', 'output')
#replace c.xlsx with c.csv to convert to CSV
#replace c.xlsx with c.xml to convert to XML
#replace c.xlsx with c.html to convert to HTML
#This is documentation code for your information
Don't forget to replace my-api-key with your api key, input.pdf with the path of your pdf, and ouput to the path of the directory you would like to save the output excel document to.
Option 2, using textract to read the pdf and then writing to the spreadsheet using xlwt:
- Install textract with
pip install textract
- Install xlwt with
pip install xlwt
Once you have installed the dependencies, you can run the following code:
import textract
import xlwt
from xlwt import Workbook
wb = Workbook()
text = textract.process("path/to/file.extension") #You'll have to change this to your path to the file
I do not know about how your pdf is organized but you'll have to figure out how to write to the excel document from there. (you can use sheet1.write(1, 0, 'Data')
where 1 and 0 are your coordinates on your spreadsheet.
I personally think you should use the pdf_tables API instead of manually doing the conversion.