0

Maybe this question seems a bit strange, but it has a very practical use case.

Assume that we selected arbitrary a section of a PDF file to generate a checksum, such as the selected text (highlighted text) in the following screen shot:

enter image description here

And then we generate a checksum from selected text using a hash function. We deliver (and not send) the whole of PDF file along with this checksum to a receiver, such that this receiver does NOT know which section of PDF file has been selected and hashed. And this receiver wants to verify this checksum. So, they need to know exactly which section of the PDF file has been selected and hashed. So, we need to find a solution by which this receiver can find the exact position of the selected and hashed text.

Since a hash function is not reversible, to the question is that:

How this receiver can find exactly the selected and hashed text in PDF file?

For example, is it feasible to determine the exact location and position of the selected and hashed text in PDF file? (It is very sensitive, since even a wrong character or space can lead to failure of checksum verification.)

Is there a reliable approach for this challenge?

Note 1: If the question is not clear enough, please let me know to explain it in more details.

Important: Please note that because of limitation of space, we can only store the checksum value plus some limited data that show the position of selected text, meaning that we cannot store the entire selected text.

use case: we intent to verify the integrity of selected texts in the document by a verifier. The checksum along with information which address to the hashed text, will be stored in the blockchain, so because of limitations of storing in the blockchain (it's costly), we cannot store the entire selected and hashed text in the blockchain, instead we store only some useful information that address to the exact position of selected and hashed text. The verifier has access to the entire document, however they do not know which section of document has been hashed. They need to know it to verify the checksum.

Assume ex. a prover has a certificate (paper), he needs to prove he is owner of certificate.He scan certificate (digitize it to any format is better). Issuer of certificate has selected some sensitive parts of certificate (ex. owner info, etc) and hashed them separately each selected sections to generate checksum. When prover (owner) deliver certificate to a verifier, the verifier needs to check all checksums. at this step, heneeds to know which parts of certificate have been hashed. So, we need to attach useful data to checksums by which verifier can find hashed sections.

Please also note that the selected text is not recorded, but also it is selected to generate checksum. however the verifier needs to know the content of this text to verify checksum. the problem is that because of limitations of storing data in blockchain, we cannot store the entire hashed text, but also we can only store some useful information which address to the exact position of hashed text.

Update: This question is related to (FREE Tool for watching coordinates in PDF) where using a tool we would be able to find the exact (x,y) coordinates of a selected text. I am not yet sure that this tool can be used for my question.

Questioner
  • 662
  • 1
  • 10
  • 26

1 Answers1

1

Note that the PDF file doesn't contain text. It contains a tree of objects, some of which are streams that contain a simplified variant of Postscript, containing commands that tell the renderer which glyphs to put where (or other commands to render graphical output).

I'd recommend to use a tool like mutool from the mupdf package to decompress the streams in a small PDF document and open it in a text editor to see for yourself how it looks like.

So when you select "text" in a renderer, you are hooking into the renderer's process that puts glyphs on the page. Now the renderer can make some effort to re-translate the glyphs into text, which relies on (1) having tables for that in the PDF, (2) assumptions how the application that produced the PDF worked (for example, it laid out glyphs in the same order as the original text). If you hash this re-translated text, it will always depend on the method the renderer used to do the re-translation.

So your use case (whatever it is good for) will need identical rendering programs for the sender and the receiver.

On the other hand, assuming either embedded fonts or identical fonts, rendering is deterministic (in particular in the same renderer). So the simplest way would be just to record the exact selection of your position on the page, and the page number, and then send this information.

Edit

If you are scanning a paper document in the first place, and need to mark several rectangular areas, pick some format for the image, find the exact pixel position of the rectangles, extract the pixels inside the rectangle into some defined format (e.g. RGB 8+8+8), and hash this data. Then transmit the rectangle position together with the hash.

You can store multiple scanned images in a PDF for convenience, and then extract them from the PDF with a number of tools, but it doesn't really matter how you store the images, as long as you agree on some format (because with lossy compression it may change the pixel values).

This will require you to archive the scanned images (as PDF, or any other form).

dirkt
  • 463
  • 1
  • 4
  • 12
  • Thank you for your useful information. In general, is there **any other file format** (instead of using PDF) such that it would be easier to do this process? meaning that we can ex. send receiver the exact position of the hashed text, so that receiver would be able to find the exact selected text. Thanks – Questioner Sep 24 '18 at 11:13
  • The simplest such format would be plain ASCII or UTF-8 formatted text. If you want recommendations, I'd suggest you edit your question and explain your use case in detail (what exactly is your final goal? What do you want achieve, and why?). – dirkt Sep 24 '18 at 11:22
  • It's added. if you need more details, please let me know. Thanks – Questioner Sep 24 '18 at 11:31
  • Sorry, but "we intent to verify the integrity of selected texts" doesn't really help. What kind of documents are these? Are they PDFs by necessity? Do they have to be formatted, or is it just the information that counts? Is changing the PDF to highlight the selection (and then using a cryptographic hash on the whole changed PDF) an option? Who is the audience, on either side? Can you make assumptions on which software is used, on either side? What kind of program is going to record the selections, and make the hash? Etc., pp. – dirkt Sep 24 '18 at 11:40
  • Assume ex. a prover has a certificate (paper), he needs to prove he is owner of certificate.He scan certificate (digitize it to any format is better). Issuer of certificate has selected some **sensitive** parts of certificate (ex. owner info, etc) and hashed them separately each selected sections to generate checksum. When prover (owner) deliver certificate to a verifier, the verifier needs to check all checksums. at this step, heneeds to know which parts of certificate have been hashed. So, we need to attach useful data to checksums by which verifier can find hashed sections. WhatDoYouThink? – Questioner Sep 24 '18 at 11:59
  • Btw, the selected text is not recorded, but also it is selected to generate checksum. however the verifier needs to know the content of this text to verify checksum. the problem is that because of limitations of storing data in blockchain, we cannot store the entire hashed text, but also we can only store some useful information which address to the exact position of hashed text. – Questioner Sep 24 '18 at 12:05
  • Thank you but how to **find the exact pixel position of the rectangles, extract the pixels inside the rectangle into some defined format** ? Is there any tools to do this? And eventually, in this case, is verifier able to find **selected hashed text**? Please note that verifier cannot see which part of document is selected for hashing but also verifier is only able to see **original** and **virgin** document. Thanks – Questioner Sep 24 '18 at 13:48
  • You are writing some sort of program to do this anyway, aren't you? Pick any programming language, any image viewing library, and a GUI to select rectangles and extract a rectangular image as about a man-day or so. There are also command line tools like ImageMagick or GraphicsMagick that can extract rectangles. You are not hashing the **text**, you are hashing the **pixels** in the rectangle, because converting an image to text is hard. Both the verifier and the singer will hash the exact same rectangular image region, provided they can access both the same scanned image. – dirkt Sep 24 '18 at 16:33
  • Thank you, at the moment in the PDF part, no, I do not write a program and just I am looking for some tools to be sure that if it is practicable. I develop a program for the blockchain part, smart contract (on-chain). This part is done (off-chain) meaning that outside of the blockchain. Thanks. Hashing **pixels** seems very interesting. I need to test it. – Questioner Sep 24 '18 at 17:24
  • In case of hashing **pixels** instead of text, is verifier of checksum able to find the exact selected and hashed pixels? Is there any tools to do this? or I need to implement it by myself? (As you know, even a wrong pixel leads to failure of checksum validation). Thank you. – Questioner Sep 25 '18 at 10:19