PDF data extraction with Python 3.4

Question

BACKGROUND I am using Python 3.4, PyPDF2 and Regular Expressions to extract data from the table on page 1 of the following PDF:

http://minerals.usgs.gov/minerals/pubs/commodity/gold/mcs-2015-gold.pdf.

import PyPDF2
import re

gold_pdf = r'C:\Users\xxxxx_x\xxxxxxx\mcs_gold_2015.pdf'
pdfFileObj = open(gold_pdf,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

pageObj = pdfReader.getPage(0)
pageObj.extractText()

start_pos = pageObj.extractText().index('United States\n:')
end_pos = pageObj.extractText().index('Recycling\n:')

table_text = pageObj.extractText()[start_pos:end_pos]

print(table_text)
print(re.findall(r'\d+[\d,]*\d', table_text))`

*Results* - NOTE: Scroll Left & Right
['2010', '2011', '2012', '2013', '2014', '231', '234', '235', '230', '211', '175', '220', '222', '223', '200', '198', '263', '215', '210', '200', '616', '550', '326', '315', '315', '383', '644', '695', '691', '430', '180', '168', '147', '160', '165', '8,140', '8,140', '8,140', '8,140', '8,140', '1,228',  '1,572', '1,673', '1,415', '1,270', '10,300', '11,100', '12,700', '12,958', '12,500']

PROBLEM: There are a lot more PDF's from the USGS Mineral Commodity Summary with a similar structure that I am trying to scrape with PyPDF2, but it doesn't work. I already checked with them and the data aren't available in any other format.

For example, if you use the Silver PDF (http://minerals.usgs.gov/minerals/pubs/commodity/silver/mcs-2015-silve.pdf) instead of the Gold PDF in my example above, I don't get the desired resultes.

NOTE: Scroll left & right
*OUTPUT from PageObj.extracttext():*
'SILVER\n \n\nDomestic Production and     Use\n:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSalient Statistics\nŠUnited States\n:2010 2011 20122013 2014e\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n \n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nRecycling\n:\n\nImport Sources (2010\nŒ13)\n:2\nTariff\n:\nDepletion Allowance\n:\n Government Stockpile\n:\nEvents, Trends, and Issues\n: \n\n\n\n\n\n\n\n\n\n\n\n\n\n\nFlorence C. Katrivanos\n [(703) 648\nŒ6782, fkatrivanos@usgs.gov]\n '

??? - Why isn't the data extracting the same way for the Silver PDF as the Gold PDF

What Python Library to use for Python 3.4? I cant find good solutions for PDF scrapping for Python 3.4 (see the following post: Best tool for text extraction from PDF in Python 3.4)

Thanks so much for your assistance!

score 1 · Answer 1 · answered Dec 30 '15 at 00:13

1

My recommendation would be to use pdftotext from the poppler suite for this. This is commonly found on Linux and other UNIX-like systems, but there are versions for MS-windows available.

import subprocess as sp

filename = 'C:\Users\xxxxx_x\xxxxxxx\mcs_gold_2015.pdf'
btext = sp.check_output(['pdftotext', '-layout', filename])
text = btext.decode('utf-8')

Both the files you linked to converted fine using this method.

answered Dec 30 '15 at 00:13

Roland Smith

42,427
3
64
94

Thank you for your reply Roland. So is it as simple as downloading the Latest binary for Windows : poppler-0.37_x86.7z and then running the code above? Thanks so much for your assistance I am new to Python, so I am trying to figure things out. – mickey224 Dec 31 '15 at 18:49
@mickey224 The download is an archive that you need to unpack the archive somewhere, preferably in a directory that is in the `$PATH`. And it could be that you need to change `pdftotext` to `pdftotext.exe`. – Roland Smith Jan 01 '16 at 20:48

PDF data extraction with Python 3.4

1 Answers1