1

I have extracted some Arabic text from pdfs, the pdfs showing the text correctly. However, the extraction evidently uses an incorrect encoding, such that the text is represented as lots of weird characters. I tried utf-8, ucs-2, ansi, windows-1256, oem 720 and iso arabic, but neither of these are correct.

The problem persists regardless of the extraction technique. The information there, the (few) latin characters are presented correctly, as are numbers, spaces, etc - just the Arabic characters appear as special roman characters instead.

I could now manually map each wrong character on the correct one, but there has got to be a better way. Is there a way to try other encodings that can display Arabic characters and find out which one is the correct one?

What I get is this, for instance, in utf-8:

ÊUOMOD�K� ÊUÐUý f�√ ¡U�� bNA²Ý« ≠ ÍœuLÝ wKŽË w$d�ô« œULŽ ≠ 5Mł ≠ …ež WOMOD�K� ‰“UM� vKŽ WOKOz«dÝô« WOF�bLK� nB� w� Õ«d−Ð ÊËdš¬ WFЗ√ VO�«Ë Æ…ež ŸUD� w�ULý UO¼ô XOÐ …bKÐ w� ≠≤∂ ’ WOI³�«≠

I am using python, so if there is a pythonable solution that is great. But anything that tells me what encoding I should use is welcome.

Thanks a lot!

C Baden
  • 21
  • 5
  • http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file – Joran Beasley Dec 22 '15 at 23:25
  • If you ever find a solution, please post it here. Extracting Arabic text from PDFs is a known unsolved (to the best of my knowledge) problem, regardless of what PDF library used. I only tried with open-source libraries. And I didn't hear success stories from those who tried with proprietary ones. – Not Important Dec 22 '15 at 23:50
  • if this pdf is public, share us the link so we can test it – Assem Dec 23 '15 at 10:28
  • i found an example. this one for instance I have trouble decoding correctly: http://www.alquds.com/old/pdfs/pdf/20060101/1/ – C Baden Jan 07 '16 at 16:50
  • I created a manual mapping. Seems to be no other way. – C Baden Jan 15 '16 at 23:18

0 Answers0