strange words appear when extract arabic text from pdf (PdfToText)

Question

I have a problem when extract arabic text from pdf.
I use PdfToText library
The text appears in this figure (΋ΎϬϧϟ΍υϔΣϟ΍ΦϳέΎΗ ΏϟΎρϟ΍ϡϳΩϘΗΝΫϭϣϧ ΩϳϘϟ΍ϡϗέ) How can i solve it ? i tried

<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />

but this did not solve my problem

I'm unsure but it could be an encoding issue - have you tested it with plain English PDF? — James, Mar 19 '18 at 16:10
yes i tested it in english pdf and it works well but in arabic pdf didn't work — Ahmed Mahmoud, Mar 19 '18 at 16:12
I've never used it and unsure if it's open source to download once registered but perhaps this could help? http://arabicpdf.com/PdfDebugger/ — James, Mar 19 '18 at 16:15

ino · Answer 1 · 2018-03-19T16:25:13.280

0

English letters are part of basic ASCII char set so the output is usually without any problems however any other languages using various accents or even different letters, ie. Arabic, Azbuka, Greek, etc. uses letters out of the basic set.

Make sure all three sources are using same encoding:

all the PHP scripts generating the output
the HTML encoding meta tag
the output file as well

ad 1
Check your editor how it saves the PHP scripts to the file system. The way how to set it up differs from each editor

ad 2 Use HTML meta tag <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

ad 3 define the encoding to use UTF-8 for example: pdftotext -enc UTF-8 your.pdf. According to the documentation the PdfToText class generates UTF8-encoded text.

edited Mar 19 '18 at 16:25

answered Mar 19 '18 at 16:15

ino

2,345
1
15
27

how i can define the encoding to use utf-8 ? – Ahmed Mahmoud Mar 19 '18 at 16:43
I have mentioned 3 places where it should be set. Which one is your question related to? – ino Mar 19 '18 at 17:02
The official documentation of PdfToText says that its class generates UTF8-encoded text. – ino Mar 19 '18 at 17:53
So I would focus on First item and made sure your IDE is saving PHP scripts in UTF-8. What is your php script editor? Is it saving your files in proper encoding? – ino Mar 19 '18 at 17:54
and the encoding? – ino Mar 19 '18 at 17:58
Is there anything else I can use? – Ahmed Mahmoud Mar 19 '18 at 17:58
how i can now the encoding? – Ahmed Mahmoud Mar 19 '18 at 17:58
https://stackoverflow.com/questions/21289157/set-encoding-of-file-to-utf8-with-bom-in-sublime-text-3 – ino Mar 19 '18 at 17:59
the same problem – Ahmed Mahmoud Mar 19 '18 at 18:02
Are you sure PdfToText library supports Arabic alphabet? – ino Mar 19 '18 at 18:48
Do you have something else? – Ahmed Mahmoud Mar 20 '18 at 11:23

strange words appear when extract arabic text from pdf (PdfToText)

1 Answers1