I have a large amount of PDF files in my local filesystem I use as documentation base and I would like to create an index of these files. I would like to :
- Parse the contents of the PDF files to get keywords.
- Select the most relevant keywords to make a summary.
- Create static HTML pages for some keywords with entries linked to the appropriate files.
My questions are :
- Is there an existing tool to perform the whole job ?
- What is the most appropriate tool to parse PDF files content, filter (by words size) and counting the words?
- I consider using
Perl
,swish-e
,pdfgrep
to make a script. Do you know other tools which could be useful?