automatically extract text from pdf for many files

Question

I have about 10,000 of pdf files(conf papers) and I need to extract text from certain section (like the experimental section) of these papers and save in a file. Does anyone know a java tool or some python tool which can help me do this?

Thanks in advance

Ayush

I am not sure about just getting special part of pdf but for whole part you check my post to this question which much simpler than other methods http://stackoverflow.com/questions/15583535/how-to-extract-text-from-a-pdf-file-in-python/15588435#15588435 — Moj, Apr 22 '13 at 19:57

score 2 · Answer 1 · answered Apr 22 '13 at 17:25

2

Did you research your question before posting? I just googled and found this Apache project: http://pdfbox.apache.org/

answered Apr 22 '13 at 17:25

eldris

205
1
5

do you have any other suggestion like in python – ayush singhal Apr 22 '13 at 17:27
http://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text suggests http://www.unixuser.org/~euske/python/pdfminer/index.html – eldris Apr 22 '13 at 17:29
doe pdf box has a function to extract text only from a section of paper and not the entire text of the pdf? – ayush singhal Apr 22 '13 at 19:52

score 1 · Accepted Answer · answered Apr 22 '13 at 17:27

1

For java: have a look at iText

For python I would use PDFMiner

answered Apr 22 '13 at 17:27

Johnny

512
2
7
18

do you know the function in pdfbox which allows me to extract text only from certain section of the research articles rather then the whole text? – ayush singhal Apr 22 '13 at 19:51

score 0 · Answer 3 · answered Nov 15 '13 at 02:28

Since these are academic papers, you should also really look at lapdftext

LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance where needed). The system is open-source and provides a simple baseline function for extracting text from primary research articles using rules that developers can customize.

automatically extract text from pdf for many files

3 Answers3