Arabic pdf text extraction

Question

I'm trying to extract text from Arabic pdfs - raw data extraction not OCR -.

I tried many packages, tools and none of them worked, python packages, pdfBox, adobe API, and many other tools and all of them field to extract the text correctly, either it reads the text LTR or it do wrong decoding.

Here is a two sample from different tools
sample 1:

املحتويات

7 الثانية الطبعة مقدمة
9 وتاريخه األدب -١
51 الجاهليون -٢
95 الشعر نحل أسباب -٣
149 والشعراء الشعر -٤
213 مرض شعر -٥
271 الشعر -٦
285 الجاهيل النثر -٧

sample 2:

ﺔﻴﻧﺎﺜﻟا ﺔﻌﺒﻄﻟا ﺔﻣﺪﻘﻣ
ﻪﺨﻳرﺎﺗو بدﻷا -١
نﻮﻴﻠﻫﺎﺠﻟا -٢
ﺮﻌﺸﻟا ﻞﺤﻧ بﺎﺒﺳأ -٣
ءاﺮﻌﺸﻟاو ﺮﻌﺸﻟا -٤
ﴬﻣ ﺮﻌﺷ -٥
ﺮﻌﺸﻟا -٦
ﲇﻫﺎﺠﻟا ﺮﺜﻨﻟا -٧

original text and yes I can copy it and get the same rendered text.

are there any tool that can extract Arabic text correctly

the book link can be found here

K J · Accepted Answer · 2022-10-03T21:39:20.027

1

The text in a PDF is not the same as the text used for its construction, we can see that in your example where page 7 is shown in Arabic on the surface but is coded as 7 in the plain text.

However a greater problem is the Languages as supported by fonts, so in Notepad I had to accept a script font to see a similarity, but that is using a font substitution.

Another complication is Unicode and whitespace ordering.

so the result from

pdftotext -f 5 -l 5 في_الأدب_الجاهلي.pdf try.txt

At best will look like

Thus in summary your Sample 1 is equal if not better, than any other simple attempt.

Later Edit from B.A. comment below

I found a way to go around this, after extracting the text I open the txt file and normalize its content using unicodedata python module that offers unicodedata.normalize() function. So I can now say that pdftotext is the best tool for Arabic text extraction

edited Oct 03 '22 at 21:39

answered Jun 09 '22 at 14:59

K J

8,045
3
14
36

actually pdftotext is working perfectly in this pdf, this is what I got `‫اﳌﺤﺘﻮﻳﺎت‬ ‫‪7‬‬ ‫ﻣﻘﺪﻣﺔ اﻟﻄﺒﻌﺔ اﻟﺜﺎﻧﻴﺔ‬ ‫‪9‬‬ ‫‪ -١‬اﻷدب وﺗﺎرﻳﺨﻪ‬ ‫‪51‬‬ ‫‪ -٢‬اﻟﺠﺎﻫﻠﻴﻮن‬ ‫‪95‬‬ ‫‪ -٣‬أﺳﺒﺎب ﻧﺤﻞ اﻟﺸﻌﺮ‬ ‫‪149‬‬ ‫‪ -٤‬اﻟﺸﻌﺮ واﻟﺸﻌﺮاء‬ ‫‪213‬‬ ‫‪ -٥‬ﺷﻌﺮ ﻣﴬ‬ ‫‪271‬‬ ‫‪ -٦‬اﻟﺸﻌﺮ‬ ‫‪285‬‬ ‫‪ -٧‬اﻟﻨﺜﺮ اﻟﺠﺎﻫﲇ‬ ` it decode stream correctly and produces the right sequence order. Thanks for suggesting it. – B.A Jun 12 '22 at 13:35
However the only problem I found, is that in some documents it choose the wrong representation of a character. to eliminate, in Arabic the character م has 4 different representation(مـ , ـمـ , ـم , م), based on its position in a word. so pdftotext instead of outputting "مواهبك" it will output ـمواهبك do you any idea why or how to solve that? – B.A Jun 12 '22 at 13:37
1

ok I found a way to go around this, after extracting the text I open the txt file and normalize its content using [unicodedata](https://docs.python.org/3/library/unicodedata.html) python module that offers `unicodedata.normalize()` function. So I can now say that pdftotext is the best tool for Arabic text extraction – B.A Jul 13 '22 at 11:13

score 0 · Answer 2 · answered Oct 03 '22 at 08:58

0

Unicode Normalization should be fixing that issue. (you can choose NFKC)

Most programming languages have a normal. check here for more info about normalization. https://unicode.org/reports/tr15/

answered Oct 03 '22 at 08:58

Ahmed Ayman

11
1

Arabic pdf text extraction

2 Answers2