Trouble reading string with non-ascii characters in python 3

Question

I am trying to read images from WikiArt dataset. However, I cannot load some images which contain non-ascii characters: For example: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' although the file exists in the directory. I also compared the output string name from os.listdir() and the one from FileNotFoundError: No such file: '/wiki_art_paintings/rescaled_600px_max_side/Expressionism/fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' by doing 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' == 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'. The output is False.

What can be a problem here?

Please add your code and a proper output of the error traceback — Or Y, Dec 23 '20 at 06:40
did you check char-by-char what codes have chars? you could write script to do this. Maybe you have two chars which look the same but they have different code. or maybe there is code which is not displayed on screen. — furas, Dec 23 '20 at 06:40
when I check char-by-char then it shows me `ã` as two chars `a ̃ ` - In unicode it is possible — furas, Dec 23 '20 at 06:45
how do you get these files? Maybe it could be corrected when you get files and put on disk. And what system do you use? Maybe problem makes system - once I had problem with MacOS because it was using UTF-8 in different standard. — furas, Dec 23 '20 at 06:53
@furas, I have several image folders and csv file which contains some data + reference to the images in those folders as string — kilich, Dec 23 '20 at 06:58
I found only old code which test different method to convert MacOS filenames in Unicode to Linux Unicode [macosx-linux-UTF-8](https://github.com/furas/python-examples/blob/master/decode-encode/macosx-linux/main.py) Using function `unidecode()` I can conver both versions to the same `fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg` so they can be compared (`==` will gives `True`) but it is still useless to get name from file and create filename to open image. — furas, Dec 23 '20 at 07:07
Seams to be related to normalized/denormalized forms. Take a look at this https://stackoverflow.com/questions/3126929/python-denormalize-unicode-combining-characters — mgruber4, Dec 23 '20 at 08:12

furas · Accepted Answer · 2020-12-23T07:58:32.977

Problem is because in Unicode you can use single character or create some character as combinations of two other charactes and you have both situations in two different places. In one place you have some characters as single characters (with single code) and in other place you have characters as combinatins of two other characters (with two codes). You can see even difference when you use len() for boths strings. In your example one version has lenght 53 and other has 52

It seems you could convert one name to another using unicodedata.normalize() with one of option NFC, NFKC, NFD, NFKD. So you have to test which one will work for you.

In one direction you may need NFC or NFKC, in other direction you may need NFD or NFKD.

You can also use unidecode to create text without native characters: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg but this may not be so useful for you.

import unicodedata
from unidecode import unidecode

a = 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
b = 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'

print('a:', a)
print('b:', b)

print('--- len ---')
print('len(a):', len(a))
print('len(b):', len(b))

print('--- encode ---')
print('a.encode:', a.encode('utf-8'))
print('b.encode:', b.encode('utf-8'))

print('--- a == normalize(b) ---')
print('NFC: ', a == unicodedata.normalize('NFC', b) )
print('NFKC:', a == unicodedata.normalize('NFKC', b) )
print('NFD: ', a == unicodedata.normalize('NFD', b) )
print('NFKD:', a == unicodedata.normalize('NFKD', b) )

print('--- b == normalize(a) ---')
print('NFC: ', b == unicodedata.normalize('NFC', a) )
print('NFKC:', b == unicodedata.normalize('NFKC', a) )
print('NFD: ', b == unicodedata.normalize('NFD', a) )
print('NFKD:', b == unicodedata.normalize('NFKD', a) )

print('--- unidecode ---')
print('a:', unidecode(a))
print('b:', unidecode(b))

Result:

a: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
b: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
--- len ---
len(a): 53
len(b): 52
--- encode ---
a.encode: b'fa\xcc\x83\xc2\xa9lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
b.encode: b'f\xc3\xa3\xc2\xa9lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
--- a == normalize(b) ---
NFC:  False
NFKC: False
NFD:  True
NFKD: True
--- b == normalize(a) ---
NFC:  True
NFKC: True
NFD:  False
NFKD: False
--- unidecode ---
a: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
b: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg

I met characters as combination of two other characters only when I have to transfer MacOS files to other system

Doc: unicodedata

Pythonsheet: Unicode

Stackoverflow: Normalizing Unicode

thank you, @furas. Simply introducing normalization helped me to resolve it. — kilich, Dec 23 '20 at 12:23

score -1 · Answer 2 · answered Dec 23 '20 at 07:00

-1

The two strings are not the same. Look:

> ciao='fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'.encode('utf-8')       
> bye='fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'.encode('utf-8')        
> ciao.hex() 
 '6661cc83c2a96c69782d64656c2d6d61726c655f6e752d6167656e6f75696c6c2d7375722d666f6e642d626c65752d313933372e6a7067'
> bye.hex()  
 '66c3a3c2a96c69782d64656c2d6d61726c655f6e752d6167656e6f75696c6c2d7375722d666f6e642d626c65752d313933372e6a7067'
> ciao2='fa'.encode('utf-8')
> bye2='f'.encode('utf-8')
> ciao2.hex()
 '6661'
> bye2.hex() 
 '66'

it seems there is an hidden character around the 'f'. It seems a 'a'

answered Dec 23 '20 at 07:00

Lews

426
4
9

nice description but it is place for solutions and I don't see any solution in your answer – furas Dec 23 '20 at 07:00
The problem is that your filename contains é. The process that created the file wrote out the name in UTF-8 which needs 2 bytes to represent é. Your filesystem doesn't understand UTF-8 so it is displaying the 2 bytes as if it were encoded as latin-1. Try putting é in the filename in your call to `open()` instead of ã©. – BoarGules Dec 23 '20 at 08:06

Trouble reading string with non-ascii characters in python 3

2 Answers2