So I got a few PDF files in Hebrew that I wanted to translate to English, and when trying to copy and paste the text from the PDF files into a text editor, all of the Hebrew final letters were incorrectly copied.
I found this question but no solution was found and that question was only talking about one specific final letter that was incorrectly read and it was only referring to a specific library.
I tried copying and pasting from both acrobat reader and the chrome PDF viewer but it failed copying the contents correctly with both of them.
Another interesting thing I found is that when you Ctrl+F in the browser (I tried it on chrome) and search for the final letter "Pe" for example, it would give results for both the regular "Pe" and the final "Pe" (and vice versa, when you search for the regular "Pe"), even though they have different code points (and different codes in the ANSI code page), which is also odd. (It's the same for all of the final letters and their corresponding regular letters)
So the question is - Does anyone know why this happens?
I get that there might be no actual code point mapped to the glyph but in that case how is it that the characters are rendered? I'm not very familiar with this subject so I would appreciate any explanation. In addition, any good solution that will allow me to extract the text with the final letters will be very very appreciated, since I would like to parse the text and having messed up letters results in incomplete words.
EDIT:
As requested by weibeld I'm adding a few copied words and the corresponding correct words.
I'll also add their hexdump.
E1 F7 F8 1B בקר. # Should be בקרן (Final letter "Nun") Switches every
final Nun with 1B instead of EF according to the windows 1255 code page.
F2 F1 F7 E9 E9 17 עסקיי. # Should be עסקיים (Final letter "Mem") Switches
every final Mem with 17 instead of ED.
Thanks!