1

Is there a way for python to detect garbled or broken characters such as below? I am reading a file as utf-8 but these characters were not detected as invalid.

従業員マスタ 
���������� 
MisterMiyagi
  • 44,374
  • 10
  • 104
  • 119
Fancy
  • 135
  • 12
  • Can you clarify what you expected to happen? These characters are not invalid. – MisterMiyagi Aug 31 '21 at 08:43
  • These are valid chars. – Zazaeil Aug 31 '21 at 08:46
  • I am actually trying to import data that are Japanese, Chinese and English. There are instances where these gets garbled and I wanted to error out the program if it encounters that in the file. – Fancy Aug 31 '21 at 08:47
  • Then you have to filter them out yourself I guess. – Zazaeil Aug 31 '21 at 08:48
  • 1
    Please post the text as a _binary_ string, e.g. `print(open(filename, 'rb').read())`. Also please post the code you are using to read the file. – Thomas Aug 31 '21 at 08:55
  • See https://stackoverflow.com/questions/24140497/unbaking-mojibake and https://github.com/rspeer/python-ftfy – Peter Wood Aug 31 '21 at 12:38

2 Answers2

1

UTF-8 is a big address space that's blocked out into subsections that are agreed upon to represent certain sets of characters.

Font-designers might decide to (or not to) assign glyphs to characters in any of the 2,097,152 available slots in that space, but tend to serve only the most common blocks - or for specialist fonts, specific blocks that serve a specific purpose, like Egyptian or ancient Mayan Hieroglyphics for example. None of those characters are "wrong" or "garbled", they just might not have a defined picture in your chosen font.

Similarly, the characters you identify will have a meaning somewhere - or, if they are a result of encoding errors, it's likely to be due to characters being written in one encoding and then sloppily re-encoded into another one.

A useful tool to deal with common cases like this is the ftfy module that scoops up text that looks problematic, and returns cleansed text where those encoding problems are sorted out.

(or, as I've just read in the comments, they might relate to another language block like Japanese kanji or Chinese characters - these will occupy a fairly specific block-range1 that you can use to identify characters that belong to Chinese, Japanese or Chinese characters hosted within UTF-8. There are methods to convert the characters into their numeric values that you can then check are within the ranges you suspect you have to filter out.

1 from U+4E00 to U+9FFF inclusive by the looks of things.

Thomas Kimber
  • 10,601
  • 3
  • 25
  • 42
1

Your comment suggests you have pretty specific sets of characters that you expect, so I would attempt to specify ranges of unicode characters. You can use them to filter out any characters outside of these ranges and handle them accordingly, e.g. delete them or raise an Exception.

Here is a suggestion on how to implement the filtering.

Here is the unicode ranges for Chinese/Japanese/Korean characters.

Jotha
  • 418
  • 2
  • 7