12

I'm reading filenames from file system and I want to send them as JSON encoded array. The problem is that files on file system can be stored in invalid encoding, and I need to handle this situation to omit invalid filenames before passing it to json.dump, otherwise it will fail.

Is there a way to check that my string (filename) contains valid utf-8 chars?

troex
  • 1,090
  • 1
  • 12
  • 21
  • 2
    Shock me. *Why* would the files not have valid UTF-8 filenames? – Ignacio Vazquez-Abrams Mar 10 '11 at 11:43
  • it's the file name that is not encoded in utf-8 or is it the data in the file ?? i m confused. – mouad Mar 10 '11 at 11:45
  • 2
    How about buggy software that creates filenames based on ID3 tags without checking the encoding? Or mounting (with the wrong options) an old filesystem that uses an odd character encoding for filenames? – Mark Longair Mar 10 '11 at 11:47
  • 1
    Invalid encoding can be big problem when moving data from old (non utf-8) systems (like WinXP with non-US/EN locale) and especially files in .zip and .rar archives files created on these systems – troex Mar 10 '11 at 12:48
  • @IgnacioVazquez-Abrams because the filename/filesystem could be corrupted. – styrofoam fly Jan 04 '18 at 16:24

1 Answers1

19

How about trying the following?

valid_utf8 = True
try:
    filename.decode('utf-8')
except UnicodeDecodeError:
    valid_utf8 = False

... based on an answer to a similar question here: How to write a check in python to see if file is valid UTF-8?

Community
  • 1
  • 1
Mark Longair
  • 446,582
  • 72
  • 411
  • 327
  • if isinstance(filename, unicode): print "unicode string" – mithuntnt Nov 22 '13 at 16:14
  • @mithuntnt: the question isn't asking about whether a Python string is a `unicode`; it's asking whether the bytes the make up a filename in the filesystem are valid UTF-8. – Mark Longair Nov 22 '13 at 18:16
  • 1
    This will not catch strings that contain high/low surrogates (u+d800 to u+dfff). – awm Jul 22 '15 at 06:38