How can I check whether every single character in each line in a file contains valid utf8? Lines that contain invalid utf8 characters should be left out.
Here is my code that is not working:
lines = [
"correct UTF-8 text: Here come the tests",
"correct UTF-8 text: You should see the Greek word 'kosme':'κόσμε'",
"not utf-8: U+FDD0 .. U+FDEF = ''",
"not utf-8: 6 bytes (U-7FFFFFFF): '������' ",
"ăѣծềſģȟᎥǩľḿꞑȯȶψ1234567890!$%^&*()-_=+[{]};:',<.>/?~Ḇ٤ḞԍНǏƘԸⲘ০ΡɌȚЦѠƳȤѧᖯćễႹļṃʼnоᵲꜱừŵź1234567890!@#$%^&*()-_=+[{]};:,<.>/?~АḂⲤꞠꓧȊꓡǬŖꓫŸảƀḋếᵮℊᎥкιṃդⱺŧṽẉყž1234567890!@#$%^&*()-_=+[{]};:',<.>/?~ѦƇᗞΣℱԍҤ١КƝȎṚṮṺƲᏔꓫᏏçძḧҝɭḿṛтúẃ⤬1234567890!@#$%^&*()-_=+[{]};:',<.>/?~ΒĢȞỈꓗʟℕ০ՀꓢṰǓⅤⲬ§"
]
for line in lines:
try:
print(line)
except UnicodeDecodeError:
print("UnicodeDecodeError: " + line)
pass
Not all lines should not be printed. What is wrong with my code?
If I take a file (I saved this page as file.txt: https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html) and try to check utf-8 there I get an UnicodeDecodeError:
File "test.py", line 13, in for line in file: File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 240: character maps to undefined
The code:
file = open("file.txt", "r")
for line in file:
line = line.strip()
try:
print(line)
except UnicodeDecodeError:
print("UnicodeDecodeError " + line)
pass
Should the except-line & pass not ensure that if a UnicodeDecodeError occurs, it will be ignored and the script continues with the next line ?