I have extracted some Arabic text from pdfs, the pdfs showing the text correctly. However, the extraction evidently uses an incorrect encoding, such that the text is represented as lots of weird characters. I tried utf-8, ucs-2, ansi, windows-1256, oem 720 and iso arabic, but neither of these are correct.
The problem persists regardless of the extraction technique. The information there, the (few) latin characters are presented correctly, as are numbers, spaces, etc - just the Arabic characters appear as special roman characters instead.
I could now manually map each wrong character on the correct one, but there has got to be a better way. Is there a way to try other encodings that can display Arabic characters and find out which one is the correct one?
What I get is this, for instance, in utf-8:
ÊUOMOD�K� ÊUÐUý f�√ ¡U�� bNA²Ý« ≠ ÍœuLÝ wKŽË w$d�ô« œULŽ ≠ 5Mł ≠ …ež WOMOD�K� ‰“UM� vKŽ WOKOz«dÝô« WOF�bLK� nB� w� Õ«d−Ð ÊËdš¬ WFЗ√ VO�«Ë Æ…ež ŸUD� w�ULý UO¼ô XOÐ …bKÐ w� ≠≤∂ ’ WOI³�«≠
I am using python, so if there is a pythonable solution that is great. But anything that tells me what encoding I should use is welcome.
Thanks a lot!