PDF files and JavaScript: how to check if PDF is native or scanned

Question

I am developing a front-end application using JavaScript / HTML / CSS. This application allows the user to upload PDF files. I am looking for a way, using JavaScript, to discriminate whether a pdf file is native or scanned.

A native PDF is a PDF of a document that was “born digital” because the PDF was created from an electronic version of a document, rather than from print.

A scanned PDF, by contrast, is a PDF of a print document, such as when you scan in pages from a print journal and then save this file as a PDF. Please only submit native PDFs.

In the case of native pdf I don't want to allow the upload, while in the other case I want to allow the upload. I found this JavaScript library: https://pdfjs.express/ Maybe here's what I need but I don't know where to start. In stackoverflow I found something about it, but nothing about JavaScript code.

What's the **exact** difference between "native" and "auto-generated"? I would assume that all PDF files are generated through any kind of software — Nico Haase, Jul 09 '21 at 12:59
When a PDF file is not Digital Native. So, in this case, I suppose the PDF is not auto-generated. — stackismylifedontyouforget, Jul 09 '21 at 13:04
To know more: https://support.publishers.jstor.org/hc/en-us/articles/360042578374-What-is-a-born-digital-or-native-PDF- — stackismylifedontyouforget, Jul 09 '21 at 13:05
That is just the real-life definition of those terms, but those don’t mean anything to a computer. You will have to find _technical properties_ of those PDFs (if they actually exist), that will help you to tell them both apart somehow. — CBroe, Jul 09 '21 at 13:10
"Rather than from print" sounds strange, as there are programs that can be used to add PDF generation capabilities through a virtual printer. Maybe you could distinguish this by checking whether there are only pages consisting of images (which would be a good indicator for that "scanned PDF" category), or if you can extract text from the PDF — Nico Haase, Jul 09 '21 at 13:11
This is a very useful question: https://stackoverflow.com/questions/63494812/how-can-i-distinguish-a-digitally-created-pdf-from-a-searchable-pdf but is in Python!! — stackismylifedontyouforget, Jul 09 '21 at 13:16

Rick Hitchcock · Answer 1 · 2021-07-19T13:09:51.623

3

A "native PDF" will nearly always contain a /Font object.

A "scanned PDF" will probably not.

This should work in the vast majority of cases:

fetch(url)
  .then(response => response.blob())
  .then(data => data.text())
  .then(data => {
    if (/\/Font/.test(data)) {
      console.log('Probably native');
    } else {
      console.log('Probably scanned');
    }
  })

In response to your comments:

To make this more accurate would require parsing the entire file, which is non-trivial since PDF objects are often LZW-compressed. Reference. Also, PDFs could sometimes have a mixture of scanned text with regular text. So there's no way to make this 100% accurate.

It would be a security risk for JavaScript to access local files. If you're running a server, the user could upload their file, and the server could parse it using Node.js.

edited Jul 19 '21 at 13:09

answered Jul 09 '21 at 13:48

Rick Hitchcock

35,202
5
48
79

What is the /Font and why a "scanned" PDF would not have it ? – Itération 122442 Jul 09 '21 at 13:52
1

A /Font object is required in any PDF file that contains text. Reference: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf – Rick Hitchcock Jul 09 '21 at 13:55
If a PDF is scanned by a free app is it possible that this app inserts FETCH inside and then this script fails? How do you think you can do a more accurate analysis? – stackismylifedontyouforget Jul 09 '21 at 15:24
The presence of the string "fetch" within a PDF file would not affect this code. A more accurate analysis would require a full parsing of the file, which would require a lot more code. – Rick Hitchcock Jul 09 '21 at 15:31
How do you make this analysis more accurate? What do I need to know to be able to do it myself? Also, can you modify the code of your post in such a way that it is able to get the PDFs from local and not with remote url? – stackismylifedontyouforget Jul 09 '21 at 15:44

PDF files and JavaScript: how to check if PDF is native or scanned

1 Answers1