I have some files which are present on my Linux system. These files names can be other the un_eng-utf8. I want to convert them from non-utf8 character to the utf-8 character. How can I do that using C library function or python scripts.
-
2Do you know the original encoding? Are all characters available in UTF-8 or do you want to replace the unavailable with similar ones, ignore them or do something else about them? – TobiMarg Oct 13 '15 at 10:35
2 Answers
If you know the character encoding that is used to encode the filenames:
unicode_filename = bytestring_filename.decode(character_encoding)
utf8filename = unicode_filename.encode('utf-8')
If you don't know the character encoding then there is no way in the general case to do the conversion without loosing data -- "non-utf8" is not specific enough e.g., if you have a filename that contains b'\xae'
byte then it can be interpreted differently depending on the filename encoding -- it is u'®'
in cp1252
encoding but the same byte represents u'«'
in cp437
. There are modules such as chardet
that allow you to guess the character encoding but it is only a guess -- "There Ain't No Such Thing as Plain Text."

- 399,953
- 195
- 994
- 1,670
def converttoutf8(a):
return unicode(a, "utf-8")
now for every filename you iterate through, that will return the utf-8 formatted filename
or even better, use convmv. it converts filenames from one encoding to another and takes a directory as an argument. sounds perfect.

- 1,007
- 6
- 14
-
your code assumes that `a` bytestring is encoded using utf-8 encoding. OP explicitly says that input is not utf-8. – jfs Oct 13 '15 at 11:09