1

How do I convert this string:

"\xa0かかわらず"

to this string?:

"かかわらず"

i.e. How do I remove non-alphanumeric unicode characters? I've tried the solution that encodes the string as ascii, but it doesn't work for Japanese symbols.

Harry Stuart
  • 1,781
  • 2
  • 24
  • 39
  • 4
    What is your definition of non-alphanumeric ? usually "かかわらず" is also treated as non-alphanumeric. – ymonad Jul 18 '19 at 02:53
  • 1
    My intuition was that alphanumeric is any character in a given language's alphabet, so for my case, an alphanumeric character is any character that is a digit or in the Japanese alphabet. – Harry Stuart Jul 18 '19 at 02:54
  • http://www.fileformat.info/info/unicode/char/00a0/index.htm is a spacing character, maybe you actually just want to normalize any spacing characters to regular spaces? – tripleee Jul 18 '19 at 02:54
  • Perhaps you are looking for something like https://stackoverflow.com/a/38617492/874188 – tripleee Jul 18 '19 at 02:57
  • 2
    Using `re.sub` to replace the `\W` (non-word) pattern with an empty string should work, e.g. `re.sub(r'\W', '', "\x0aかか\x0aわらず")`. – metatoaster Jul 18 '19 at 02:59
  • Is the target language only japanese ? how about kanji ? unicode has lots of [block](https://en.wikipedia.org/wiki/Unicode_block)s, [script](https://en.wikipedia.org/wiki/Script_(Unicode))s so you can choose the closest block (or script) and find regex library that supports it – ymonad Jul 18 '19 at 03:01
  • what encoding are you using ? – Obmerk Kronen Jul 18 '19 at 03:03
  • @metatoaster, your approach worked for me. Out of curiosity, how does regex know that `"。"` is a non-word character? Has it been configured for universality across languages? Similarly, how does it know that `"ず"` is a word character? – Harry Stuart Jul 18 '19 at 04:01
  • Every unicode codepoint (e.g. `a`, `あ`, `甲` or `。`) has specific attributes defined in the unicode standard, and usually grouped together within a contiguous range. Python does not have explicit documentation on what `\w` actually means in terms of unicode categories as defined by [Unicode CLDR](http://cldr.unicode.org/index). For a better, more well-defined regex library that actually supports the defined syntax/category labels, [this answer](https://stackoverflow.com/a/36188204/) provides additional info. – metatoaster Jul 18 '19 at 04:31
  • Relatedly, the dotnet character class in regex has [much better documentation](https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#word-character-w) than the Python stdlib documentation on what the `\w` set expands to, which is [`[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Lm}\p{Mn}\p{Nd}\p{Pc}]`](https://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5Cp%7BLl%7D%5Cp%7BLu%7D%5Cp%7BLt%7D%5Cp%7BLo%7D%5Cp%7BLm%7D%5Cp%7BMn%7D%5Cp%7BNd%7D%5Cp%7BPc%7D%5D&g=&i=). Using `\W` (capitalised) negates that set. – metatoaster Jul 18 '19 at 04:37
  • Interesting! I didn't realise unicode characters had such specifically defined attributes - I guess that's why unicode is of such utility. If you want to formalise this your comment into an answer I'll select it as correct. – Harry Stuart Jul 18 '19 at 04:42

1 Answers1

0

Using re.sub to replace the \W (non-word) pattern with an empty string should work, e.g.

re.sub(r'\W', '', "\x0aかか\x0aわらず")

– metatoaster Jul 18 '19 at 2:59

This works. Since metatoaster only wrote it as a comment and not everybody reads them, I felt free to write this as an actual answer...

Walchy
  • 1,150
  • 3
  • 11
  • 18