0

I am developing a front-end application using JavaScript / HTML / CSS. This application allows the user to upload PDF files. I am looking for a way, using JavaScript, to discriminate whether a pdf file is native or scanned.

A native PDF is a PDF of a document that was “born digital” because the PDF was created from an electronic version of a document, rather than from print.

A scanned PDF, by contrast, is a PDF of a print document, such as when you scan in pages from a print journal and then save this file as a PDF. Please only submit native PDFs.

In the case of native pdf I don't want to allow the upload, while in the other case I want to allow the upload. I found this JavaScript library: https://pdfjs.express/ Maybe here's what I need but I don't know where to start. In stackoverflow I found something about it, but nothing about JavaScript code.

  • 3
    What's the **exact** difference between "native" and "auto-generated"? I would assume that all PDF files are generated through any kind of software – Nico Haase Jul 09 '21 at 12:59
  • When a PDF file is not Digital Native. So, in this case, I suppose the PDF is not auto-generated. – stackismylifedontyouforget Jul 09 '21 at 13:04
  • To know more: https://support.publishers.jstor.org/hc/en-us/articles/360042578374-What-is-a-born-digital-or-native-PDF- – stackismylifedontyouforget Jul 09 '21 at 13:05
  • 2
    That is just the real-life definition of those terms, but those don’t mean anything to a computer. You will have to find _technical properties_ of those PDFs (if they actually exist), that will help you to tell them both apart somehow. – CBroe Jul 09 '21 at 13:10
  • "Rather than from print" sounds strange, as there are programs that can be used to add PDF generation capabilities through a virtual printer. Maybe you could distinguish this by checking whether there are only pages consisting of images (which would be a good indicator for that "scanned PDF" category), or if you can extract text from the PDF – Nico Haase Jul 09 '21 at 13:11
  • This is a very useful question: https://stackoverflow.com/questions/63494812/how-can-i-distinguish-a-digitally-created-pdf-from-a-searchable-pdf but is in Python!! – stackismylifedontyouforget Jul 09 '21 at 13:16

1 Answers1

3

A "native PDF" will nearly always contain a /Font object.

A "scanned PDF" will probably not.

This should work in the vast majority of cases:

fetch(url)
  .then(response => response.blob())
  .then(data => data.text())
  .then(data => {
    if (/\/Font/.test(data)) {
      console.log('Probably native');
    } else {
      console.log('Probably scanned');
    }
  })

In response to your comments:

To make this more accurate would require parsing the entire file, which is non-trivial since PDF objects are often LZW-compressed. Reference. Also, PDFs could sometimes have a mixture of scanned text with regular text. So there's no way to make this 100% accurate.

It would be a security risk for JavaScript to access local files. If you're running a server, the user could upload their file, and the server could parse it using Node.js.

Rick Hitchcock
  • 35,202
  • 5
  • 48
  • 79
  • What is the /Font and why a "scanned" PDF would not have it ? – Itération 122442 Jul 09 '21 at 13:52
  • 1
    A /Font object is required in any PDF file that contains text. Reference: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf – Rick Hitchcock Jul 09 '21 at 13:55
  • If a PDF is scanned by a free app is it possible that this app inserts FETCH inside and then this script fails? How do you think you can do a more accurate analysis? – stackismylifedontyouforget Jul 09 '21 at 15:24
  • The presence of the string "fetch" within a PDF file would not affect this code. A more accurate analysis would require a full parsing of the file, which would require a lot more code. – Rick Hitchcock Jul 09 '21 at 15:31
  • How do you make this analysis more accurate? What do I need to know to be able to do it myself? Also, can you modify the code of your post in such a way that it is able to get the PDFs from local and not with remote url? – stackismylifedontyouforget Jul 09 '21 at 15:44