Config for pytesseract (Urdu language)

Asked Aug 08 '21 at 21:03

Active Sep 10 '22 at 16:39

Viewed 437 times

I am having some problems with pytesseract. With this line of code pytesseract works poorly with Urdu language:

text = pytesseract.image_to_string(img, lang="urd")

What configuration should I use to improve the accuracy for Urdu language? And what kind of pre-processing can I do on the image?

I am using this kind of image: TestFile

For the image attached the output should be:

بعد نجی ٹی وی سے گفتگو کرتے ہوئے وزیر خارجہ شاہ محمود قریشی نے بتایا کہ ملاقات

But the output I am getting is:

٦ری‏ وی سے کلوکرتے ہونے وز خارمہ اہ مود رٹ نے نال لات

Images are in these fonts: Pak Nastaleeq, Alvi Nastaleeq, Jameel Noori Nastaleeq, Nafees Nastaleeq.

edited Aug 09 '21 at 17:08

asked Aug 08 '21 at 21:03

Samee Arif

Can you please provide the desired output in terms of (copy-pasted) Unicode characters? I'd like to get an impression, how that written text differs visually from the common representation like in the [Urdu alphabet Wikipedia article](https://en.wikipedia.org/wiki/Urdu_alphabet). Is that some kind of hand-writing or is that (computer) typed text? – HansHirse Aug 09 '21 at 10:41
@HansHirse Thank you for your response. I have edited my question and copy-pasted Urdu characters. – Samee Arif Aug 09 '21 at 12:28

0 Answers0