0

I have downloaded more than 2000 pdfs which are anual reports of companies .I want to extract specific data from the forms of pdf, such as the loans.I have tried pdfMiner and pypdf, but they didn't seem to work well, is there any other ways using programming?

  • *I want to extract specific data from the forms of pdf, such as the loans*. Not informative at all. Please post an example of such pdf – Yannis P. Feb 01 '16 at 15:14
  • In short, I need to extract some tables in the pdf without ruinning their formats like importing them to excel or stata. THX! @YannisP. – Stephen Yuan Feb 01 '16 at 17:42
  • One option would be to use Tabula [http://tabula.technology/]. Otherwise Apache Tika can export the pdf to html or xml and from there you can try different hacks – Yannis P. Feb 01 '16 at 18:30

0 Answers0