0

I've got to extract some specific data from a high number of .pdf files. My problem is the first thing I have to do is to convert .pdf to .txt so I can easily find the data I'm interested in. After conversion there's high amount of artefacts in .txt files ( like page numbers, hyperlinks to contents page, footers, headers etc. ). These .pdf files are quite huge ones ( every single file is transcription of about 7-12 hours of people talking ) so I simply cannot afford deleting this things manually ( I've got ~60 .pdf files ). My question is - does someone know a tool which allows automatic deletion of such contents?

I'd be glad to hear every proposition which will improve my work :) thanks!

Gont.M
  • 103
  • 1
  • 1
  • 6
  • duplicate of http://stackoverflow.com/questions/3650957/how-to-extract-text-from-a-pdf ? – Christian Cerri May 09 '16 at 10:25
  • Definitely helpfull, thanks! Dunno why I didn't realize this post earlier. – Gont.M May 09 '16 at 10:51
  • *does someone know a tool which allows automatic deletion of such contents* - Can you describe how those elements can be recognized automatically? Like "footers are always starting at 1 inch above page bottom" etc. – mkl May 09 '16 at 13:09

0 Answers0