problem in displaying pasted content form persian pdf

Question

i have a pdf code in persian language but when i try to copy the contents , the contents will display as nonsense alphabets (except numbers) example : i copied some text from my pdf and paste it here : 371960012100240806356111 => '371960012100240806356111' number pasted right

but when i try to copy something like name گلچین فر it will be pasted £3⁄4ÉuÅ{

how can i fix this problem ? i want to extract the contents with python and it works but i cant display the names correctly !

the pdf sample file is here : https://ufile.io/qibejys1

thanks

Apparently your PDF does not contain the information required for text extraction. Ask the distributor of the PDF for a version that does. — mkl, Jul 19 '21 at 16:11
If you provide a link to your problematic PDF, we could help you to investigate — Kfcaio, Jul 19 '21 at 18:34
As assumed above, the PDF does not contain the information required for text *extraction* (**ToUnicode** tables or self-explanatory **Encodings**), so that won't work. An alternative approach would be **optical recognition** of the text as proposed by @Kfcaio's answer. — mkl, Jul 20 '21 at 17:17

Kfcaio · Accepted Answer · 2021-07-20T17:56:38.053

0

You may want to try the following steps:

Install Tesseract 4 or higher, check official tutorial
Get the Persian-specific model and copy it to your local tessdata folder
Convert problematic PDF pages (split pages first, take a look at pdftk tool) to tiff (In ubuntu, use convert command)
Run something like tesseract -l fas image.tiff text.txt
Tweak your command with options, like psm

edited Jul 20 '21 at 17:56

answered Jul 20 '21 at 11:08

Kfcaio

442
1
8
20

1

I think you mean the [Persian-specific model](https://github.com/tesseract-ocr/tessdata/blob/master/fas.traineddata). – lenz Jul 20 '21 at 16:19

problem in displaying pasted content form persian pdf

1 Answers1