Parse PDF in Node.js

Question

I am using meteor-react for uploading PDF docs to my Node.js backend, where I want to read the uploaded PDF doc, as a json, or whatever. Is it possible? And what library/tool would you recommended for that? Thank you!

score 13 · Accepted Answer · answered Jan 03 '18 at 19:17

13

There are a couple of Node packages for parsing PDF:

pdf2json: https://www.npmjs.com/package/pdf2json
pdfreader: https://www.npmjs.com/package/pdfreader

Check out their Github and documentation pages. It appears to me that pdf2json is a more complete solution, while pdfreader might be easier to get started with. You'll have to experiment and choose based on your project requirements.

answered Jan 03 '18 at 19:17

Arash Motamedi

9,284
5
34
43

1

Im used the pdf2json package, and it was totaly easy to get out the pdf fields with value. Thanks for recommendation. – peter Jan 04 '18 at 14:59
1

The only problem, the pdf parser worked locally, but when we pushed to our test server, than got an error like: parserError: "An error occurred while parsing the PDF: InvalidPDFException" And Its really hard to find out what is the problem :/ – peter Jan 12 '18 at 18:09
That's unfortunate. I suspect the library is depending on an external library being installed on the machine. A couple ideas come to my mind for narrowing down the problem: 1. Can you create a VM on your local machine with similar specs as your server (mostly OS) and try running your code there. 2. Can you prototype a quick sample app using the other library, and see if that one works when deployed to your server? – Arash Motamedi Jan 12 '18 at 18:17
And finally, if none of those work, can you please give us the complete error message (including stack trace) maybe there's some hint there that we can track in the library's source code. – Arash Motamedi Jan 12 '18 at 18:19
Im also tought the 2. solution, but let me try the first one. Here is the error msg: https://stackoverflow.com/questions/48230265/pdf2json-invalidpdfexception-in-production-build – peter Jan 12 '18 at 18:33
One more idea that comes to mind: can you try opening a local file (instead of pulling from CDN URL) and see if that makes any difference? – Arash Motamedi Jan 12 '18 at 21:52
But using pdf2json still not distinguish between Normal Text and Heading level text which is generated by MS-Word Export pdf – Onk_r Jan 17 '19 at 07:01
I have tried above mentioned packages and also some other.I want to get key value pairs like Quotation number: 10470290 Quotation date: 02/06/2021 etc.... but these packages are shuffling text. I even tried https://www.npmjs.com/package/pdf2html still no luck. Any idea on this ? – Ujjual Jun 03 '21 at 06:01
1

pdf2json is tragically bad code. You should avoid it. Unfortunately pdfreader depends on pdf2json too. Skip all that nonsense and use Mozilla's pdf.js – ibash Dec 22 '22 at 23:08

Parse PDF in Node.js

1 Answers1