Remove all spaces for chinese characters while keeping necessary spaces for english in Python regex

Question

Let's say my dataframe has column which is mixed with english and chinese words or characters, I would like to remove all the whitespaces between them if they're chinese words, otherwise if they're english, then keep one space only between words:

I have found a solution for removing extra spaces between english from here

import re
import pandas as pd

s = pd.Series(['V e  r y calm', 'Keen and a n a l y t i c a l',
'R a s h and careless', 'Always joyful', '你 好', '黑 石  公 司', 'FAN     STUD1O', 'beauty face 店  铺'])

Code:

regex = re.compile('(?<![a-zA-Z]{2})(?<=[a-zA-Z]{1}) +(?=[a-zA-Z] |.$)')
s.str.replace(regex, '')

Out:

Out[87]: 
0              Very calm
1    Keen and analytical
2      Rash and careless
3          Always joyful
4                    你 好
5               黑 石  公 司
dtype: object

But as you see, it works out for english but didn't remove spaces between chinese, how could get an expected result as follows:

Out[87]: 
0              Very calm
1    Keen and analytical
2      Rash and careless
3          Always joyful
4                    你好
5                 黑石公司
dtype: object

Reference: Remove all spaces between Chinese words with regex

That's a good question @Bohemian, in some rare cases, for example `X Y Z company`, the only one space should be removed and get `XYZ company` in fact, but I've no idea how to solve this issue. — ah bon, Nov 16 '20 at 03:56
I've updated new elements for serie `s`, it seems none of your solution work out as it's mixture of english and chinese. Could someone help to test again? — ah bon, Nov 16 '20 at 05:45

score 3 · Answer 1 · answered Nov 16 '20 at 03:43

3

You could use the Chinese (well, CJK) Unicode property \p{script=Han} or \p{Han}.
However, this only works if the regex engine supports UTS#18 Unicode regular expressions. The default Python re module does not but you can use the alternative (much improved) regex engine:

import regex as re

rex = r"(?<![a-zA-Z]{2})(?<=[a-zA-Z]{1})[ ]+(?=[a-zA-Z] |.$)|(?<=\p{Han}) +"
test_str = ("V e  r y calm\n"
    "Keen and a n a l y t i c a l\n"
    "R a s h and careless\n"
    "Always joyful\n"
    "你 好\n"
    "黑 石  公 司")
result = re.sub(rex, "", test_str, 0, re.MULTILINE | re.UNICODE)

Results in

Very calm
Keen and analytical
Rash and careless
Always joyful
你好
黑石公司

Online Demo (the demo is using PCRE for demonstration purposes only)

answered Nov 16 '20 at 03:43

wp78de

18,207
7
43
71

1

I apply your code to one column `rent_name`, using `df['rent_name'].replace(re.compile(r"(?<![a-zA-Z]{2})(?<=[a-zA-Z]{1})[ ]+(?=[a-zA-Z] |.$)|(?<=\p{Han}) +", re.MULTILINE | re.UNICODE))`, it returns `error: bad escape \p`, any ideas? – ah bon Nov 16 '20 at 05:21
Are you using `import regex as re`? – wp78de Nov 16 '20 at 05:27
Yes, it raises an error: `TypeError: replace() missing 1 required positional argument: 'repl'`. If `test_str` is one column in dataframe, how could I use your code? – ah bon Nov 16 '20 at 05:29

Bohemian · Answer 2 · 2020-11-16T13:40:38.587

2

Use word boundaries \b in look arounds:

(?<=\b\w\b) +(?=\b\w\b)

This matches spaces between solitary (bounded by word boundaries) "word characters", which includes Chinese characters.

Pre python 3 (and for java for example), \w only matches English letters, so you would need to add the unicode flag (?u) to the front of the regex.

s = ['V e  r y calm', 'Keen and a n a l y t i c a l',
'R a s h and careless', 'Always joyful', '你 好', '黑 石  公 司']
regex = r'(?<=\b\w\b) +(?=\b\w\b)'
res = [re.sub(regex, '', line) for line in s]
print(res)

Output:

['Very calm', 'Keen and analytical', 'Rash and careless', 'Always joyful', '你好', '黑石公司']

edited Nov 16 '20 at 13:40

answered Nov 16 '20 at 03:45

Bohemian

412,405
93
575
722

Why do you need `(?u)`? It is on by default in Python 3.x. – Wiktor Stribiżew Nov 16 '20 at 10:36
@WiktorStribiżew I didn’t know that. I’m a python newbie. Answer simplified and note made re python version. Thx – Bohemian Nov 16 '20 at 13:41

score 2 · Accepted Answer · answered Nov 16 '20 at 03:48

This regex should get you what you want. See the full code snippet at the bottom.

regex = re.compile(
    "((?<![a-zA-Z]{2})(?<=[a-zA-Z]{1})\s+(?=[a-zA-Z]\s|.$)|(?<=[\u4e00-\u9fff]{1})\s+)",
    re.UNICODE,
)

I made the following edits to your regex above: Right now, the regex basically matches all spaces that appear after a single-letter word and before another single character word.

I added a part at the end of the regex that would select all spaces after a Chinese character (I used the unicode range [\u4e00-\u9fff] which would cover Japanese and Korean as well.
I changed the spaces in the regex to the whitespace character class \s so we could catch other input like tabs.
I also added the re.UNICODE flag so that \s would cover unicode spaces as well.

import re
import pandas as pd

s = pd.Series(
    [
        "V e  r y calm",
        "Keen and a n a l y t i c a l",
        "R a s h and careless",
        "Always joyful",
        "你 好",
        "黑 石  公 司",
    ]
)

regex = re.compile(
    "((?<![a-zA-Z]{2})(?<=[a-zA-Z]{1})\s+(?=[a-zA-Z]\s|.$)|(?<=[\u4e00-\u9fff]{1})\s+)",
    re.UNICODE,
)
s.str.replace(regex, "")

Output:

0              Very calm
1    Keen and analytical
2      Rash and careless
3          Always joyful
4                     你好
5                   黑石公司
dtype: object

Sorry, your solution seem not working for `FAN STUD1O`, I think maybe we need to `df['Day'].str.capitalize()` then apply your code? — ah bon, Nov 16 '20 at 06:08
I would handle that case with a simpler regex that combines all spaces in a string `s.str.replace(re.compile("\s+", re.UNICODE), " ")` — Roger Z, Nov 16 '20 at 15:54

Remove all spaces for chinese characters while keeping necessary spaces for english in Python regex

3 Answers3