How to check pdf is exist or same 80% in mysql? User want to upload pdf. But problem is reup. I think covert pdf to binary => I will have a string "X"(binary of that pdf) to save in mysql. => Select like %(splice (1/3 length(X) -> 2/3 length(X)). maybe do it? im using laravel thank for reading
-
So you are asking to check if two PDF files are similar for atleast 80%? i think that the best method is the generate a checksum out off 80% from that file and compare those – Raymond Nijland Jun 08 '18 at 13:14
-
3What does "same" mean? Same text? Same binary data? If I take 80% off a PDF and replace the rest with zeroes, it will probably not be readable at all. If I take 80% of the text from a PDF and copy it into a different document and save it as PDF, it will have a negligible binary overlap with the original (because PDF files typically use zip-encoded chunks, which will vary widely even with substantial duplicate contents) – tucuxi Jun 08 '18 at 13:35
-
2Finally, two PDFs which render the exact same pixels may be almost completely different, because PDF is a representation format, and you can replace all text with images of its characters and visually, nobody will know the difference. – tucuxi Jun 08 '18 at 13:38
-
MySQL is almost certainly the wrong tool for this. I can give you 2 PDFs with the *exact same* content but different file data just by rearranging the table of contents. – DavidW Jun 08 '18 at 15:56
1 Answers
This cannot be done reasonably in MySQL. Since you are also using a PHP environment, it may be possible to perform via PHP, but to achieve a general solution you will need substantial effort.
PDF files are composed of (possibly compressed) streams of images and text. Several libraries can attempt to extract the text, and will work reasonably well if the PDF was generated in a straightforward way; however, they will typically fail if some text was rendered as images of its characters, or if other ofuscation has been applied. In those cases, you will need to use OCR to generate the actual text as it is seen when the PDF is displayed. Note also that tables and images are out-of-scope for these tools.
Once you have two text files, finding overlaps becomes much easier, although there are several techniques. "Same 80%" can be interpreted in several ways, but let us assume that copying a contiguous 79% of the text from a file and saving it again should not trigger alarms, while copying 81% of that same text should trigger them. Any diff tool can provide information on duplicate chunks, and may be enough for your purposes. A more sophisticated approach, which however does not provide exact percentages, is to use the normalized compression distance.

- 17,561
- 2
- 43
- 74