1

How can I check whether every single character in each line in a file contains valid utf8? Lines that contain invalid utf8 characters should be left out.

Here is my code that is not working:

lines = [
"correct UTF-8 text: Here come the tests", 
"correct UTF-8 text: You should see the Greek word 'kosme':'κόσμε'", 
"not utf-8: U+FDD0 .. U+FDEF = '﷐﷑﷒﷓﷔﷕﷖﷗﷘﷙﷚﷛﷜﷝﷞﷟﷠﷡﷢﷣﷤﷥﷦﷧﷨﷩﷪﷫﷬﷭﷮﷯'", 
"not utf-8: 6 bytes (U-7FFFFFFF): '������' ",
"ăѣծềſģȟᎥǩľḿꞑȯȶψ1234567890!$%^&*()-_=+[{]};:',<.>/?~Ḇ٤ḞԍНǏƘԸⲘ০ΡɌȚЦѠƳȤѧᖯćễႹļṃʼnоᵲꜱừŵź1234567890!@#$%^&*()-_=+[{]};:,<.>/?~АḂⲤꞠꓧȊꓡǬŖꓫŸảƀḋếᵮℊᎥкιṃդⱺŧṽẉყž1234567890!@#$%^&*()-_=+[{]};:',<.>/?~ѦƇᗞΣℱԍҤ١КƝȎṚṮṺƲᏔꓫᏏçძḧҝɭḿṛтúẃ⤬1234567890!@#$%^&*()-_=+[{]};:',<.>/?~ΒĢȞỈꓗʟℕ০ՀꓢṰǓⅤⲬ§"
]

for line in lines:
    try:
        print(line)
    except UnicodeDecodeError:
        print("UnicodeDecodeError: " + line)
        pass

Not all lines should not be printed. What is wrong with my code?


If I take a file (I saved this page as file.txt: https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html) and try to check utf-8 there I get an UnicodeDecodeError:

File "test.py", line 13, in for line in file: File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 240: character maps to undefined

The code:

file = open("file.txt", "r")

for line in file:
    line = line.strip()
    try:
        print(line)
    except UnicodeDecodeError:
        print("UnicodeDecodeError " + line)
        pass

Should the except-line & pass not ensure that if a UnicodeDecodeError occurs, it will be ignored and the script continues with the next line ?

  • 4
    "�" *is* a valid UTF-8 character, otherwise you couldn't show it here in the question. What exact *bytes* does your file contain…? – deceze Mar 17 '20 at 09:21
  • ```print("not utf-8: 6 bytes (U-7FFFFFFF): '������' ")``` – Aven Desta Mar 17 '20 at 09:26
  • I took the file from this question: https://stackoverflow.com/questions/1301402/example-invalid-utf8-string (Markus Kuhn's UTF-8 decoder capability and stress test file). All lines from this file are also printed. – annamaria07 Mar 17 '20 at 09:27
  • Do you find the code okay? – annamaria07 Mar 17 '20 at 09:29
  • 2
    Did you copy and paste the contents of that file into your own file? Then you haven't copied those *problems* with it. You'll need to use that file as is, or generate an actual file that contains those problematic *byte sequences*, at a lower level than you can do with a regular text editor. – deceze Mar 17 '20 at 09:31
  • 1
    Dude, you need to store the malformed UTF-8 code unit sequence as `bytes`. Then try to `.decode("utf-8")` it. All the strings in your `lines` list are actually valid UTF-8 strings. – GordonAitchJay Mar 17 '20 at 09:33
  • The fact taht you can't **see** the character doesn't means it is not UTF-8! Your environment may not be able to draw it because the font is unable to or things like that. Thus your strings may all contains valid UTF-8 encodings but your IDE, console, etc are not able to show them... � is the most common symptom in that situation... – Jean-Baptiste Yunès Mar 17 '20 at 09:48

2 Answers2

2

As lines in your example are str literals, they are valid unicode sequences.

Please try to replace lines with raw bytes like this:

lines = [
    b"valid utf-8",
    b"not utf-8: U+FDD0 .. U+FDEF = '\xFD\xD0\FD\EF",
]

for line in lines:
    try:
        print(line.decode())
    except UnicodeDecodeError:
        print('Invalid UTF-8: {}'.format(line))

Note the explicit decode call in a loop.

leporo
  • 594
  • 7
  • 7
2

The problem is that you copied and pasted the displayed text. As I am never confident on a screen I controled the actual content of the last 2 strings with [(i,hex(ord(i))) for i in t].

The 4th one gives:

[('n', '0x6e'), ('o', '0x6f'), ('t', '0x74'), (' ', '0x20'), ('u', '0x75'), ('t', '0x74'), ('f', '0x66'), ('-', '0x2d'), ('8', '0x38'), (':', '0x3a'), (' ', '0x20'), ('6', '0x36'), (' ', '0x20'), ('b', '0x62'), ('y', '0x79'), ('t', '0x74'), ('e', '0x65'), ('s', '0x73'), (' ', '0x20'), ('(', '0x28'), ('U', '0x55'), ('-', '0x2d'), ('7', '0x37'), ('F', '0x46'), ('F', '0x46'), ('F', '0x46'), ('F', '0x46'), ('F', '0x46'), ('F', '0x46'), ('F', '0x46'), (')', '0x29'), (':', '0x3a'), (' ', '0x20'), ("'", '0x27'), ('�', '0xfffd'), ('�', '0xfffd'), ('�', '0xfffd'), ('�', '0xfffd'), ('�', '0xfffd'), ('�', '0xfffd'), ("'", '0x27'), (' ', '0x20')]

All offending bytes have been replaced by the REPLACEMENT CHARACTER U+FFFD which is a perfectly valid Unicode character. So this is a correct string

The 3rd one gives:

[('n', '0x6e'), ('o', '0x6f'), ('t', '0x74'), (' ', '0x20'), ('u', '0x75'), ('t', '0x74'), ('f', '0x66'), ('-', '0x2d'), ('8', '0x38'), (':', '0x3a'), (' ', '0x20'), ('U', '0x55'), ('+', '0x2b'), ('F', '0x46'), ('D', '0x44'), ('D', '0x44'), ('0', '0x30'), (' ', '0x20'), ('.', '0x2e'), ('.', '0x2e'), (' ', '0x20'), ('U', '0x55'), ('+', '0x2b'), ('F', '0x46'), ('D', '0x44'), ('E', '0x45'), ('F', '0x46'), (' ', '0x20'), ('=', '0x3d'), (' ', '0x20'), ("'", '0x27'), ('\ufdd0', '0xfdd0'), ('\ufdd1', '0xfdd1'), ('\ufdd2', '0xfdd2'), ('\ufdd3', '0xfdd3'), ('\ufdd4', '0xfdd4'), ('\ufdd5', '0xfdd5'), ('\ufdd6', '0xfdd6'), ('\ufdd7', '0xfdd7'), ('\ufdd8', '0xfdd8'), ('\ufdd9', '0xfdd9'), ('\ufdda', '0xfdda'), ('\ufddb', '0xfddb'), ('\ufddc', '0xfddc'), ('\ufddd', '0xfddd'), ('\ufdde', '0xfdde'), ('\ufddf', '0xfddf'), ('\ufde0', '0xfde0'), ('\ufde1', '0xfde1'), ('\ufde2', '0xfde2'), ('\ufde3', '0xfde3'), ('\ufde4', '0xfde4'), ('\ufde5', '0xfde5'), ('\ufde6', '0xfde6'), ('\ufde7', '0xfde7'), ('\ufde8', '0xfde8'), ('\ufde9', '0xfde9'), ('\ufdea', '0xfdea'), ('\ufdeb', '0xfdeb'), ('\ufdec', '0xfdec'), ('\ufded', '0xfded'), ('\ufdee', '0xfdee'), ('\ufdef', '0xfdef'), ("'", '0x27')]

This one is a better test: the characters between U+FDDD0 and U+FDEF are indeed not valid Unicode characters. What happens here is that Python just do not care and happily processes them.

If you want to control that they are truely valid Unicode characters, you can use the unicodedata module. Its name function will give the official name of all known valid characters and raise a value error on those who are not:

>>> unicodedata.name('�')

'REPLACEMENT CHARACTER'
>>> unicodedata.name('\uFFFD')

'REPLACEMENT CHARACTER'
>>> unicodedata.name('\uFDD0')

Traceback (most recent call last):
  File "<pyshell#18>", line 1, in <module>
    unicodedata.name('\uFDD0')
ValueError: no such name

But if you want to test whether bytes represent valid utf-8, then you should use bytes and not an unicode string.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252