0

i have a pdf code in persian language but when i try to copy the contents , the contents will display as nonsense alphabets (except numbers) example : i copied some text from my pdf and paste it here : 371960012100240806356111 => '371960012100240806356111' number pasted right


but when i try to copy something like name گلچین فر it will be pasted £3⁄4ÉuÅ{

how can i fix this problem ? i want to extract the contents with python and it works but i cant display the names correctly !

the pdf sample file is here : https://ufile.io/qibejys1

thanks

  • Apparently your PDF does not contain the information required for text extraction. Ask the distributor of the PDF for a version that does. – mkl Jul 19 '21 at 16:11
  • If you provide a link to your problematic PDF, we could help you to investigate – Kfcaio Jul 19 '21 at 18:34
  • @Kfcaio i uploaded a sample file – mehdi seifabadi Jul 20 '21 at 04:04
  • As assumed above, the PDF does not contain the information required for text *extraction* (**ToUnicode** tables or self-explanatory **Encodings**), so that won't work. An alternative approach would be **optical recognition** of the text as proposed by @Kfcaio's answer. – mkl Jul 20 '21 at 17:17

1 Answers1

0

You may want to try the following steps:

  1. Install Tesseract 4 or higher, check official tutorial
  2. Get the Persian-specific model and copy it to your local tessdata folder
  3. Convert problematic PDF pages (split pages first, take a look at pdftk tool) to tiff (In ubuntu, use convert command)
  4. Run something like tesseract -l fas image.tiff text.txt
  5. Tweak your command with options, like psm
Kfcaio
  • 442
  • 1
  • 8
  • 20
  • 1
    I think you mean the [Persian-specific model](https://github.com/tesseract-ocr/tessdata/blob/master/fas.traineddata). – lenz Jul 20 '21 at 16:19