0

this is the code i'm trying to execute for extracting text from image and save in a path.

def main():
    path =r"D drive where images are stored"
    fullTempPath =r"D drive where extracted texts are stored in xls file"
    for imageName in os.listdir(path):
        inputPath = os.path.join(path, imageName) 
        img = Image.open(inputPath) 
        text = pytesseract.image_to_string(img, lang ="eng") 
        file1 = open(fullTempPath, "a+") 
        file1.write(imageName+"\n") 
        file1.write(text+"\n") 
        file1.close()  
    file2 = open(fullTempPath, 'r') 
    file2.close()   
if __name__ == '__main__':  
    main() 

I'm getting the below error, and can someone help me on this

UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-7-fb69795bce29> in <module>
     13     file2.close()
     14 if __name__ == '__main__':
---> 15     main()

<ipython-input-7-fb69795bce29> in main()
      8         file1 = open(fullTempPath, "a+")
      9         file1.write(imageName+"\n")
---> 10         file1.write(text+"\n")
     11         file1.close()
     12     file2 = open(fullTempPath, 'r')

~\anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
     17 class IncrementalEncoder(codecs.IncrementalEncoder):
     18     def encode(self, input, final=False):
---> 19         return codecs.charmap_encode(input,self.errors,encoding_table)[0]
     20 
     21 class IncrementalDecoder(codecs.IncrementalDecoder):

UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 2090: character maps to <undefined>
GJR
  • 25
  • 6
  • [`open(fullTempPath, "a+","utf-8")`](https://utf8everywhere.org/)… – JosefZ Jan 06 '21 at 12:41
  • I've yet to find a way to disable ligature output in pytesseract on this site, but https://stb-tester.com/blog/2014/04/14/improving-ocr-accuracy might have something. – user202729 Jan 09 '21 at 08:40
  • There's also [python - Convert hexadecimal character (ligature) to utf-8 character - Stack Overflow](https://stackoverflow.com/questions/9175073/convert-hexadecimal-character-ligature-to-utf-8-character), to patch the output of tesseract; although that might be less accurate? – user202729 Jan 09 '21 at 08:42

3 Answers3

1
text = 'unicode error on this text'
text = text.decode('utf-8')

try to decode text

Samsul Islam
  • 2,581
  • 2
  • 17
  • 23
1

I don't know why Tesseract would be returning a string containing an invalid Unicode character, but that appears to be what is going on. It is possible to tell Python to ignore encoding errors. Try changing the line that opens the output file to the following:

file1 = open(fullTempPath, "a+", errors="ignore") 
CryptoFool
  • 21,719
  • 5
  • 26
  • 44
  • Hi, Thank you for the reply. Getting an error "write() takes no keyword arguments" – GJR Jan 06 '21 at 07:44
  • 1
    Doh! I added that argument to the wrong line of your code. Sorry about that. I updated the answer to fix my boo boo. – CryptoFool Jan 06 '21 at 13:13
  • @Steve Invalid? Looks pretty valid, according to the [other answer](https://stackoverflow.com/a/65620323/5267751)). – user202729 Jan 09 '21 at 08:38
  • 1
    @GayathriJayapandian You might want to use utf8 encoding as suggested in the other answer, otherwise the data in the file might be corrupted. – user202729 Jan 09 '21 at 08:38
0

The default file encoding used for open is the value returned by locale.getpreferredencoding(False) which on Windows is generally a legacy encoding that doesn't support all Unicode characters. In this case the error message indicates it was cp1252 (a.k.a Windows-1252). Best to specify the encoding you want explicitly. UTF-8 handles all Unicode characters:

file1 = open(fullTempPath, "a+", encoding='utf8')

FYI, U+FB01 is LATIN SMALL LIGATURE FI () if that makes any sense on the image being processed.

Also, Windows editors tend to assume the same legacy encoding unless the encoding is utf-8-sig which adds an encoded BOM character to the beginning of the file as an encoding hint that it is UTF-8.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251