Unicode Encode Error : 'charmap' codec can't encode character '\ufb01' in position 2090: character maps to

Question

this is the code i'm trying to execute for extracting text from image and save in a path.

def main():
    path =r"D drive where images are stored"
    fullTempPath =r"D drive where extracted texts are stored in xls file"
    for imageName in os.listdir(path):
        inputPath = os.path.join(path, imageName) 
        img = Image.open(inputPath) 
        text = pytesseract.image_to_string(img, lang ="eng") 
        file1 = open(fullTempPath, "a+") 
        file1.write(imageName+"\n") 
        file1.write(text+"\n") 
        file1.close()  
    file2 = open(fullTempPath, 'r') 
    file2.close()   
if __name__ == '__main__':  
    main()

I'm getting the below error, and can someone help me on this

UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-7-fb69795bce29> in <module>
     13     file2.close()
     14 if __name__ == '__main__':
---> 15     main()

<ipython-input-7-fb69795bce29> in main()
      8         file1 = open(fullTempPath, "a+")
      9         file1.write(imageName+"\n")
---> 10         file1.write(text+"\n")
     11         file1.close()
     12     file2 = open(fullTempPath, 'r')

~\anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
     17 class IncrementalEncoder(codecs.IncrementalEncoder):
     18     def encode(self, input, final=False):
---> 19         return codecs.charmap_encode(input,self.errors,encoding_table)[0]
     20 
     21 class IncrementalDecoder(codecs.IncrementalDecoder):

UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 2090: character maps to <undefined>

[`open(fullTempPath, "a+","utf-8")`](https://utf8everywhere.org/)… — JosefZ, Jan 06 '21 at 12:41
I've yet to find a way to disable ligature output in pytesseract on this site, but https://stb-tester.com/blog/2014/04/14/improving-ocr-accuracy might have something. — user202729, Jan 09 '21 at 08:40
There's also [python - Convert hexadecimal character (ligature) to utf-8 character - Stack Overflow](https://stackoverflow.com/questions/9175073/convert-hexadecimal-character-ligature-to-utf-8-character), to patch the output of tesseract; although that might be less accurate? — user202729, Jan 09 '21 at 08:42

score 1 · Answer 1 · edited Nov 11 '21 at 06:10

1

text = 'unicode error on this text'
text = text.decode('utf-8')

try to decode text

edited Nov 11 '21 at 06:10

Samsul Islam

2,581
2
17
23

answered Jan 06 '21 at 07:08

Yusuf Tezcan

13
3

CryptoFool · Accepted Answer · 2021-01-06T13:14:24.233

1

I don't know why Tesseract would be returning a string containing an invalid Unicode character, but that appears to be what is going on. It is possible to tell Python to ignore encoding errors. Try changing the line that opens the output file to the following:

file1 = open(fullTempPath, "a+", errors="ignore")

edited Jan 06 '21 at 13:14

answered Jan 06 '21 at 07:25

CryptoFool

21,719
5
26
44

Hi, Thank you for the reply. Getting an error "write() takes no keyword arguments" – GJR Jan 06 '21 at 07:44
1

Doh! I added that argument to the wrong line of your code. Sorry about that. I updated the answer to fix my boo boo. – CryptoFool Jan 06 '21 at 13:13
@Steve Invalid? Looks pretty valid, according to the [other answer](https://stackoverflow.com/a/65620323/5267751)). – user202729 Jan 09 '21 at 08:38
1

@GayathriJayapandian You might want to use utf8 encoding as suggested in the other answer, otherwise the data in the file might be corrupted. – user202729 Jan 09 '21 at 08:38

Mark Tolonen · Answer 3 · 2021-01-07T21:46:03.670

The default file encoding used for open is the value returned by locale.getpreferredencoding(False) which on Windows is generally a legacy encoding that doesn't support all Unicode characters. In this case the error message indicates it was cp1252 (a.k.a Windows-1252). Best to specify the encoding you want explicitly. UTF-8 handles all Unicode characters:

file1 = open(fullTempPath, "a+", encoding='utf8')

FYI, U+FB01 is LATIN SMALL LIGATURE FI (ﬁ) if that makes any sense on the image being processed.

Also, Windows editors tend to assume the same legacy encoding unless the encoding is utf-8-sig which adds an encoded BOM character to the beginning of the file as an encoding hint that it is UTF-8.

Unicode Encode Error : 'charmap' codec can't encode character '\ufb01' in position 2090: character maps to

3 Answers3