0

I have about 10,000 of pdf files(conf papers) and I need to extract text from certain section (like the experimental section) of these papers and save in a file. Does anyone know a java tool or some python tool which can help me do this?

Thanks in advance

Ayush

ayush singhal
  • 1,879
  • 2
  • 18
  • 33
  • I am not sure about just getting special part of pdf but for whole part you check my post to this question which much simpler than other methods http://stackoverflow.com/questions/15583535/how-to-extract-text-from-a-pdf-file-in-python/15588435#15588435 – Moj Apr 22 '13 at 19:57

3 Answers3

2

Did you research your question before posting? I just googled and found this Apache project: http://pdfbox.apache.org/

eldris
  • 205
  • 1
  • 5
1

For java: have a look at iText

For python I would use PDFMiner

Johnny
  • 512
  • 2
  • 7
  • 18
  • do you know the function in pdfbox which allows me to extract text only from certain section of the research articles rather then the whole text? – ayush singhal Apr 22 '13 at 19:51
0

Since these are academic papers, you should also really look at lapdftext

LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance where needed). The system is open-source and provides a simple baseline function for extracting text from primary research articles using rules that developers can customize.

vortek
  • 474
  • 2
  • 14