Check for valid utf8 string in Python

Question

I'm reading filenames from file system and I want to send them as JSON encoded array. The problem is that files on file system can be stored in invalid encoding, and I need to handle this situation to omit invalid filenames before passing it to json.dump, otherwise it will fail.

Is there a way to check that my string (filename) contains valid utf-8 chars?

Shock me. *Why* would the files not have valid UTF-8 filenames? — Ignacio Vazquez-Abrams, Mar 10 '11 at 11:43
it's the file name that is not encoded in utf-8 or is it the data in the file ?? i m confused. — mouad, Mar 10 '11 at 11:45
How about buggy software that creates filenames based on ID3 tags without checking the encoding? Or mounting (with the wrong options) an old filesystem that uses an odd character encoding for filenames? — Mark Longair, Mar 10 '11 at 11:47
Invalid encoding can be big problem when moving data from old (non utf-8) systems (like WinXP with non-US/EN locale) and especially files in .zip and .rar archives files created on these systems — troex, Mar 10 '11 at 12:48
@IgnacioVazquez-Abrams because the filename/filesystem could be corrupted. — styrofoam fly, Jan 04 '18 at 16:24

score 19 · Accepted Answer · edited May 23 '17 at 10:34

19

How about trying the following?

valid_utf8 = True
try:
    filename.decode('utf-8')
except UnicodeDecodeError:
    valid_utf8 = False

... based on an answer to a similar question here: How to write a check in python to see if file is valid UTF-8?

edited May 23 '17 at 10:34

Community

1
1

answered Mar 10 '11 at 11:41

Mark Longair

446,582
72
411
327

if isinstance(filename, unicode): print "unicode string" – mithuntnt Nov 22 '13 at 16:14
@mithuntnt: the question isn't asking about whether a Python string is a `unicode`; it's asking whether the bytes the make up a filename in the filesystem are valid UTF-8. – Mark Longair Nov 22 '13 at 18:16
1

This will not catch strings that contain high/low surrogates (u+d800 to u+dfff). – awm Jul 22 '15 at 06:38

Check for valid utf8 string in Python

1 Answers1

Linked