Index PDF files and generate keywords summary

Question

I have a large amount of PDF files in my local filesystem I use as documentation base and I would like to create an index of these files. I would like to :

Parse the contents of the PDF files to get keywords.
Select the most relevant keywords to make a summary.
Create static HTML pages for some keywords with entries linked to the appropriate files.

My questions are :

Is there an existing tool to perform the whole job ?
What is the most appropriate tool to parse PDF files content, filter (by words size) and counting the words?
I consider using Perl, swish-e, pdfgrep to make a script. Do you know other tools which could be useful?

Have a look at [recoll](https://www.lesbonscomptes.com/recoll/features.html) — John1024, Aug 18 '16 at 21:24

score 2 · Accepted Answer · edited May 23 '17 at 11:53

Given that points 2 and 3 seem custom I'd recommend to have your own script, use a tool out of it to parse pdf, process its output as you please, and write HTML (perhaps using another tool).

Perl is well suited for that, since it excels in processing that you'll need and also provides support for working with all kinds of file formats, via modules.

As for reading pdf, here are some options if your needs aren't too elaborate

Use CAM::PDF (and CAM::PDF::PageText) or PDF-API2 modules
Use pdftotext from the poppler library (probably in poppler-utils package)
Use pdftohtml with -xml option, read the generated simple XML file with XML::libXML or XML::Twig

The last two are external tools which you use via Perl's builtins like system.

The following text processing, to build your summary and design the output, is precisely what languages like Perl are for. The couple of tasks that are mentioned take a few lines of code.

Then write out HTML, either directly if simple or using a suitable module. Given your purpose, you may want to look into HTML::Template. Also see this post, for example.

Full parsing of PDF may be infeasible, but if the files aren't too complex it should work.

If your process for selecting keywords and building statistics is fairly common, there are integrated tools for document management (search for bibliography managers). However, I think that most of them resort to external tools to parse pdf so you may still be better off with your own script.

@JeanJouX Let me know if more specifics would be useful. For example, I can post (a few lines of) example code that would generate a list of words, filter and count them. — zdim, Aug 19 '16 at 20:03

Index PDF files and generate keywords summary

1 Answers1