I've got to extract some specific data from a high number of .pdf files. My problem is the first thing I have to do is to convert .pdf to .txt so I can easily find the data I'm interested in. After conversion there's high amount of artefacts in .txt files ( like page numbers, hyperlinks to contents page, footers, headers etc. ). These .pdf files are quite huge ones ( every single file is transcription of about 7-12 hours of people talking ) so I simply cannot afford deleting this things manually ( I've got ~60 .pdf files ). My question is - does someone know a tool which allows automatic deletion of such contents?
I'd be glad to hear every proposition which will improve my work :) thanks!