Is there a way for python to detect garbled or broken characters such as below? I am reading a file as utf-8 but these characters were not detected as invalid.
従æ¥å“¡ãƒžã‚¹ã‚¿
����������
Is there a way for python to detect garbled or broken characters such as below? I am reading a file as utf-8 but these characters were not detected as invalid.
従æ¥å“¡ãƒžã‚¹ã‚¿
����������
UTF-8 is a big address space that's blocked out into subsections that are agreed upon to represent certain sets of characters.
Font-designers might decide to (or not to) assign glyphs to characters in any of the 2,097,152 available slots in that space, but tend to serve only the most common blocks - or for specialist fonts, specific blocks that serve a specific purpose, like Egyptian or ancient Mayan Hieroglyphics for example. None of those characters are "wrong" or "garbled", they just might not have a defined picture in your chosen font.
Similarly, the characters you identify will have a meaning somewhere - or, if they are a result of encoding errors, it's likely to be due to characters being written in one encoding and then sloppily re-encoded into another one.
A useful tool to deal with common cases like this is the ftfy module that scoops up text that looks problematic, and returns cleansed text where those encoding problems are sorted out.
(or, as I've just read in the comments, they might relate to another language block like Japanese kanji or Chinese characters - these will occupy a fairly specific block-range1 that you can use to identify characters that belong to Chinese, Japanese or Chinese characters hosted within UTF-8. There are methods to convert the characters into their numeric values that you can then check are within the ranges you suspect you have to filter out.
1 from U+4E00 to U+9FFF inclusive by the looks of things.
Your comment suggests you have pretty specific sets of characters that you expect, so I would attempt to specify ranges of unicode characters. You can use them to filter out any characters outside of these ranges and handle them accordingly, e.g. delete them or raise an Exception.
Here is a suggestion on how to implement the filtering.
Here is the unicode ranges for Chinese/Japanese/Korean characters.