3

I am attempting to extract plain text out of a pdf document using pdf.js and for some reason am unable to get past the Invalid PDF structure error.

My code as such:

const pdfjslib = require('pdfjs-dist');

const pdfPath = 'https://www.corenet.gov.sg/media/2268607/dc19-07.pdf'

var loadingTask = pdfjslib.getDocument(pdfPath);
loadingTask.promise.then(async (doc) => {
    console.log(doc);
    return null
})
.catch((err)=>{
    console.log(err)
});

I have tried other pdf documents coming from the same domain but all throws the same error:

...
Warning: Ignoring invalid character "34" in hex string
Warning: Ignoring invalid character "104" in hex string
Warning: Indexing all PDF objects
{ Error
    at InvalidPDFExceptionClosure (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:658:35)
    at Object.<anonymous> (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:661:2)
    at __w_pdfjs_require__ (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:52:30)
    at Object.defineProperty.value (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:129:23)
    at __w_pdfjs_require__ (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:52:30)
    at pdfjsVersion (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:116:18)
    at .../pdf_test/node_modules/pdfjs-dist/build/pdf.js:119:10
    at webpackUniversalModuleDefinition (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:25:20)
    at Object.<anonymous> (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:32:3)
    at Module._compile (internal/modules/cjs/loader.js:776:30)
  name: 'InvalidPDFException',
  message: 'Invalid PDF structure' }

Other pdfs from other domains seem to work. Note that downloading the pdf from the above domain works well, and can be viewed on Chrome browser. I doubt that the pdf document is corrupted. I am not implementing any front end code as the intention of the above code is host it on cloud.

halfer
  • 19,824
  • 17
  • 99
  • 186
Koh
  • 2,687
  • 1
  • 22
  • 62
  • This answer has an alternative script for extracting data you could try - https://stackoverflow.com/a/29032269/2570277 – Nick Nov 14 '19 at 11:27
  • Maybe the website you are downloading the PDF from checks the request headers. Can you try downloading the PDF with chrome and load it locally? – Cr4xy Nov 14 '19 at 11:35
  • @Cr4xy yes, I could download the PDF and load it locally. It loads and extract the plain text correctly. If request header is the "issue", any idea how do I go around it without downloading the pdf? – Koh Nov 14 '19 at 11:45
  • You could try to copy all the headers you can find with chrome dev tools, and add them to getDocument like [here](https://github.com/mozilla/pdf.js/issues/3852#issuecomment-373169304) – Cr4xy Nov 14 '19 at 13:02
  • @Cr4xy I have attempted to copy all headers and add them to `httpHeaders`. However, the crucial header to add to make it work seems to be the `cookies` header. Is this expected? – Koh Nov 16 '19 at 07:35
  • "I doubt that the pdf document is corrupted" _don't guess, verify_. So, step 1: grab Adobe Acrobat, the free version, and try to open those PDFs. Then [edit] your post so that it mentions whether, according to the most authoritative PDF reader that exists, these PDF files are broken or not. Because if Acrobat doesn't like them, this isn't a problem with pdfjs in the slightest. – Mike 'Pomax' Kamermans Apr 29 '23 at 16:06

1 Answers1

0

Browser console log errors did not help me to fix it.

I run a PHP app (Moodle) and I went to the PHP error log and saw some variables expected to be replaced within the html source body of my certificate to be generated.

Check your backend app for error logs and the html source body provided to PDF.js for missing and undefined variables.

Try starting over the html body provided to PDF.js from scratch will help debugging the source of the exception.

Matteus Barbosa
  • 2,409
  • 20
  • 21