I am attempting to extract plain text out of a pdf document using pdf.js
and for some reason am unable to get past the Invalid PDF structure
error.
My code as such:
const pdfjslib = require('pdfjs-dist');
const pdfPath = 'https://www.corenet.gov.sg/media/2268607/dc19-07.pdf'
var loadingTask = pdfjslib.getDocument(pdfPath);
loadingTask.promise.then(async (doc) => {
console.log(doc);
return null
})
.catch((err)=>{
console.log(err)
});
I have tried other pdf documents coming from the same domain but all throws the same error:
...
Warning: Ignoring invalid character "34" in hex string
Warning: Ignoring invalid character "104" in hex string
Warning: Indexing all PDF objects
{ Error
at InvalidPDFExceptionClosure (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:658:35)
at Object.<anonymous> (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:661:2)
at __w_pdfjs_require__ (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:52:30)
at Object.defineProperty.value (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:129:23)
at __w_pdfjs_require__ (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:52:30)
at pdfjsVersion (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:116:18)
at .../pdf_test/node_modules/pdfjs-dist/build/pdf.js:119:10
at webpackUniversalModuleDefinition (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:25:20)
at Object.<anonymous> (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:32:3)
at Module._compile (internal/modules/cjs/loader.js:776:30)
name: 'InvalidPDFException',
message: 'Invalid PDF structure' }
Other pdfs from other domains seem to work. Note that downloading the pdf from the above domain works well, and can be viewed on Chrome browser. I doubt that the pdf document is corrupted. I am not implementing any front end code as the intention of the above code is host it on cloud.