Removing non-alphanumeric unicode characters from a string in Python

Question

How do I convert this string:

"\xa0かかわらず"

to this string?:

"かかわらず"

i.e. How do I remove non-alphanumeric unicode characters? I've tried the solution that encodes the string as ascii, but it doesn't work for Japanese symbols.

What is your definition of non-alphanumeric ? usually "かかわらず" is also treated as non-alphanumeric. — ymonad, Jul 18 '19 at 02:53
My intuition was that alphanumeric is any character in a given language's alphabet, so for my case, an alphanumeric character is any character that is a digit or in the Japanese alphabet. — Harry Stuart, Jul 18 '19 at 02:54
http://www.fileformat.info/info/unicode/char/00a0/index.htm is a spacing character, maybe you actually just want to normalize any spacing characters to regular spaces? — tripleee, Jul 18 '19 at 02:54
Perhaps you are looking for something like https://stackoverflow.com/a/38617492/874188 — tripleee, Jul 18 '19 at 02:57
Using `re.sub` to replace the `\W` (non-word) pattern with an empty string should work, e.g. `re.sub(r'\W', '', "\x0aかか\x0aわらず")`. — metatoaster, Jul 18 '19 at 02:59
Is the target language only japanese ? how about kanji ? unicode has lots of [block](https://en.wikipedia.org/wiki/Unicode_block)s, [script](https://en.wikipedia.org/wiki/Script_(Unicode))s so you can choose the closest block (or script) and find regex library that supports it — ymonad, Jul 18 '19 at 03:01
@metatoaster, your approach worked for me. Out of curiosity, how does regex know that `"。"` is a non-word character? Has it been configured for universality across languages? Similarly, how does it know that `"ず"` is a word character? — Harry Stuart, Jul 18 '19 at 04:01
Every unicode codepoint (e.g. `a`, `あ`, `甲` or `。`) has specific attributes defined in the unicode standard, and usually grouped together within a contiguous range. Python does not have explicit documentation on what `\w` actually means in terms of unicode categories as defined by [Unicode CLDR](http://cldr.unicode.org/index). For a better, more well-defined regex library that actually supports the defined syntax/category labels, [this answer](https://stackoverflow.com/a/36188204/) provides additional info. — metatoaster, Jul 18 '19 at 04:31
Relatedly, the dotnet character class in regex has [much better documentation](https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#word-character-w) than the Python stdlib documentation on what the `\w` set expands to, which is [`[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Lm}\p{Mn}\p{Nd}\p{Pc}]`](https://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5Cp%7BLl%7D%5Cp%7BLu%7D%5Cp%7BLt%7D%5Cp%7BLo%7D%5Cp%7BLm%7D%5Cp%7BMn%7D%5Cp%7BNd%7D%5Cp%7BPc%7D%5D&g=&i=). Using `\W` (capitalised) negates that set. — metatoaster, Jul 18 '19 at 04:37
Interesting! I didn't realise unicode characters had such specifically defined attributes - I guess that's why unicode is of such utility. If you want to formalise this your comment into an answer I'll select it as correct. — Harry Stuart, Jul 18 '19 at 04:42

score 0 · Answer 1 · answered Apr 27 '21 at 13:05

Using re.sub to replace the \W (non-word) pattern with an empty string should work, e.g.

re.sub(r'\W', '', "\x0aかか\x0aわらず")

– metatoaster Jul 18 '19 at 2:59

This works. Since metatoaster only wrote it as a comment and not everybody reads them, I felt free to write this as an actual answer...

Removing non-alphanumeric unicode characters from a string in Python

1 Answers1