0

So I got a few PDF files in Hebrew that I wanted to translate to English, and when trying to copy and paste the text from the PDF files into a text editor, all of the Hebrew final letters were incorrectly copied.

I found this question but no solution was found and that question was only talking about one specific final letter that was incorrectly read and it was only referring to a specific library.

I tried copying and pasting from both acrobat reader and the chrome PDF viewer but it failed copying the contents correctly with both of them.

Another interesting thing I found is that when you Ctrl+F in the browser (I tried it on chrome) and search for the final letter "Pe" for example, it would give results for both the regular "Pe" and the final "Pe" (and vice versa, when you search for the regular "Pe"), even though they have different code points (and different codes in the ANSI code page), which is also odd. (It's the same for all of the final letters and their corresponding regular letters)

So the question is - Does anyone know why this happens?
I get that there might be no actual code point mapped to the glyph but in that case how is it that the characters are rendered? I'm not very familiar with this subject so I would appreciate any explanation. In addition, any good solution that will allow me to extract the text with the final letters will be very very appreciated, since I would like to parse the text and having messed up letters results in incomplete words.

EDIT:
As requested by weibeld I'm adding a few copied words and the corresponding correct words. I'll also add their hexdump.

E1 F7 F8 1B    בקר.  # Should be בקרן (Final letter "Nun") Switches every 
final Nun with 1B instead of EF according to the windows 1255 code page.

F2 F1 F7 E9 E9 17 עסקיי. # Should be עסקיים (Final letter "Mem") Switches 
every final Mem with 17 instead of ED.  

Thanks!

Daniel
  • 631
  • 9
  • 20
  • What's the font encoding used by this PDF file? If you have Adobe Acrobat Reader, you can go to *File > Properties* and then click on the *Fonts* tab. – weibeld Jul 11 '17 at 17:12
  • @weibeld One of the encodings is Identity-H and the rest are either Standard or Custom. Could custom encodings be a problem? I have all of the fonts on my system. – Daniel Jul 12 '17 at 08:12
  • I think the answer by Patrick Gallot points in the right direction. It depends on the text extraction behaviour of this PDF file, i.e. which encoding this file uses for text extraction. Can you post some example words with the incorrect final letter and the corresponding correct final letter in your question? – weibeld Jul 14 '17 at 03:08
  • And can you run `echo "word" | hexdump` where `word` is an incorrect word as copied from the PDF file? – weibeld Jul 14 '17 at 03:13
  • @weibeld Added things in my edit. The problem I have with Patrick Gallot's answer is that I'm not sure if I can do all of this to my pdf files. From what I've seen I don't have permission to edit them so I assume I can't add anything to the font encodings? Correct me if I'm wrong because Patrick did not respond to my comment asking if this is possible. I would be very glad if it's possible obviously, and if there are any good libraries that might help me, do let me know! – Daniel Jul 14 '17 at 14:21
  • See if my answer is an option. – weibeld Jul 14 '17 at 18:35

2 Answers2

1

So, based on your edit, the PDF file seems to use some strange (non-ASCII-compatible) Hebrew encoding for text extraction, which places the final forms of the letters in the 1X area where in ASCII the non-printable control characters are.

If all you want is to reconstruct the text in the PDF, the easiest solution might be, not to change the PDF, but to replace the wrong codes with the correct ones after copying the text from the PDF.

For example, paste the text copied from the PDF to file and then:

cat file | tr '\033' '\357' | tr '\027' '\355' >out_file

That is, one tr for each wrong final letter. The numbers 033, 357 etc. are just the octal forms of the hexadecimal bytes 1B, EF, etc., that you found out with hexdump. Just find out the remaining mappings and add them to the chain. Then out_file should contain the correctly encoded text and you can open it with some text editor using Windows-1255.

weibeld
  • 13,643
  • 2
  • 36
  • 50
  • Hey, thank you very much for your help by the way. This is more or less what I thought about doing but I was waiting to see if anyone knew why this happened in the first place (since seeing this happen got me curious). I can mark this as the answer if you don't think it's solvable by any other way. Thank you anyways :) – Daniel Jul 14 '17 at 19:30
  • I think it's just the easiest way to get past the problem. Certainly you could analyse and recreate the PDF, but you would probably need to become a PDF expert to be able to do this. And the reason why this happened has most likely to do with a custom/improper implementation of the program that created the PDF, or any conversions that were done to the PDF after creation. Things that are hard to track down. So, feel free to accept the answer if it solves your most urgent problem. – weibeld Jul 15 '17 at 07:30
0

The PDF Reference is largely silent on the proper way to encode non-latin non-CJK text for text extraction (none of this is required for rendering glyphs) but there are essentially two ways to do so: The first is to have a ToUnicode table (for both simple and composite fonts), the second, for simple fonts, is to specify an encoding dictionary with a differences array identifying each glyph with a name from an Adobe Registry (e.g. https://github.com/adobe-type-tools/agl-aglfn/blob/master/glyphlist.txt).

Identity-H encoding implies a composite (double-byte) font, which might have a ToUnicode table. A custom-encoding implies an encoding dictionary with a differences array. Standard encoding implies that no predefined (or custom) encoding was specified.

The mix of all three together implies a very muddled origin.

Patrick Gallot
  • 595
  • 3
  • 11
  • Is it possible for me to do if I can't edit the pdf though? (Nor change the fonts) I am not quite familiar with this subject so I probably did not completely understand everything you said in your answer. – Daniel Jul 12 '17 at 20:31
  • When it concerns text extraction, you should also take into account "actualText". Content in a pdf document can be marked with a property called "actualText". It also influences copy/paste behaviour. – Joris Schellekens Jul 13 '17 at 11:39
  • I'm not aware of a good after-the-fact solution to the problem. OCR might be easiest. – Patrick Gallot Jul 14 '17 at 14:36