How can I use python and its packages to extract specific data from thousands of pdfs

Asked Feb 01 '16 at 15:08

Active Feb 01 '16 at 15:08

Viewed 50 times

I have downloaded more than 2000 pdfs which are anual reports of companies .I want to extract specific data from the forms of pdf, such as the loans.I have tried pdfMiner and pypdf, but they didn't seem to work well, is there any other ways using programming?

asked Feb 01 '16 at 15:08

Stephen Yuan

*I want to extract specific data from the forms of pdf, such as the loans*. Not informative at all. Please post an example of such pdf – Yannis P. Feb 01 '16 at 15:14
In short, I need to extract some tables in the pdf without ruinning their formats like importing them to excel or stata. THX! @YannisP. – Stephen Yuan Feb 01 '16 at 17:42
One option would be to use Tabula [http://tabula.technology/]. Otherwise Apache Tika can export the pdf to html or xml and from there you can try different hacks – Yannis P. Feb 01 '16 at 18:30

How can I use python and its packages to extract specific data from thousands of pdfs

0 Answers0