1

Let's say my dataframe has column which is mixed with english and chinese words or characters, I would like to remove all the whitespaces between them if they're chinese words, otherwise if they're english, then keep one space only between words:

I have found a solution for removing extra spaces between english from here

import re
import pandas as pd

s = pd.Series(['V e  r y calm', 'Keen and a n a l y t i c a l',
'R a s h and careless', 'Always joyful', '你 好', '黑 石  公 司', 'FAN     STUD1O', 'beauty face 店  铺'])

Code:

regex = re.compile('(?<![a-zA-Z]{2})(?<=[a-zA-Z]{1}) +(?=[a-zA-Z] |.$)')
s.str.replace(regex, '')

Out:

Out[87]: 
0              Very calm
1    Keen and analytical
2      Rash and careless
3          Always joyful
4                    你 好
5               黑 石  公 司
dtype: object

But as you see, it works out for english but didn't remove spaces between chinese, how could get an expected result as follows:

Out[87]: 
0              Very calm
1    Keen and analytical
2      Rash and careless
3          Always joyful
4                    你好
5                 黑石公司
dtype: object

Reference: Remove all spaces between Chinese words with regex

ah bon
  • 9,293
  • 12
  • 65
  • 148
  • 1
    What about single-letter words `"a"` and `"I"`? – Bohemian Nov 16 '20 at 03:41
  • 1
    Thanks, guys. Lot's of good options. – wp78de Nov 16 '20 at 03:54
  • That's a good question @Bohemian, in some rare cases, for example `X Y Z company`, the only one space should be removed and get `XYZ company` in fact, but I've no idea how to solve this issue. – ah bon Nov 16 '20 at 03:56
  • I've updated new elements for serie `s`, it seems none of your solution work out as it's mixture of english and chinese. Could someone help to test again? – ah bon Nov 16 '20 at 05:45

3 Answers3

3

You could use the Chinese (well, CJK) Unicode property \p{script=Han} or \p{Han}.
However, this only works if the regex engine supports UTS#18 Unicode regular expressions. The default Python re module does not but you can use the alternative (much improved) regex engine:

import regex as re

rex = r"(?<![a-zA-Z]{2})(?<=[a-zA-Z]{1})[ ]+(?=[a-zA-Z] |.$)|(?<=\p{Han}) +"
test_str = ("V e  r y calm\n"
    "Keen and a n a l y t i c a l\n"
    "R a s h and careless\n"
    "Always joyful\n"
    "你 好\n"
    "黑 石  公 司")
result = re.sub(rex, "", test_str, 0, re.MULTILINE | re.UNICODE)

Results in

Very calm
Keen and analytical
Rash and careless
Always joyful
你好
黑石公司

Online Demo (the demo is using PCRE for demonstration purposes only)

wp78de
  • 18,207
  • 7
  • 43
  • 71
  • 1
    I apply your code to one column `rent_name`, using `df['rent_name'].replace(re.compile(r"(?<![a-zA-Z]{2})(?<=[a-zA-Z]{1})[ ]+(?=[a-zA-Z] |.$)|(?<=\p{Han}) +", re.MULTILINE | re.UNICODE))`, it returns `error: bad escape \p`, any ideas? – ah bon Nov 16 '20 at 05:21
  • Are you using `import regex as re`? – wp78de Nov 16 '20 at 05:27
  • Yes, it raises an error: `TypeError: replace() missing 1 required positional argument: 'repl'`. If `test_str` is one column in dataframe, how could I use your code? – ah bon Nov 16 '20 at 05:29
2

Use word boundaries \b in look arounds:

(?<=\b\w\b) +(?=\b\w\b)

This matches spaces between solitary (bounded by word boundaries) "word characters", which includes Chinese characters.

Pre python 3 (and for java for example), \w only matches English letters, so you would need to add the unicode flag (?u) to the front of the regex.


s = ['V e  r y calm', 'Keen and a n a l y t i c a l',
'R a s h and careless', 'Always joyful', '你 好', '黑 石  公 司']
regex = r'(?<=\b\w\b) +(?=\b\w\b)'
res = [re.sub(regex, '', line) for line in s]
print(res)

Output:

['Very calm', 'Keen and analytical', 'Rash and careless', 'Always joyful', '你好', '黑石公司']
Bohemian
  • 412,405
  • 93
  • 575
  • 722
2

This regex should get you what you want. See the full code snippet at the bottom.

regex = re.compile(
    "((?<![a-zA-Z]{2})(?<=[a-zA-Z]{1})\s+(?=[a-zA-Z]\s|.$)|(?<=[\u4e00-\u9fff]{1})\s+)",
    re.UNICODE,
)

I made the following edits to your regex above: Right now, the regex basically matches all spaces that appear after a single-letter word and before another single character word.

  1. I added a part at the end of the regex that would select all spaces after a Chinese character (I used the unicode range [\u4e00-\u9fff] which would cover Japanese and Korean as well.
  2. I changed the spaces in the regex to the whitespace character class \s so we could catch other input like tabs.
  3. I also added the re.UNICODE flag so that \s would cover unicode spaces as well.
import re
import pandas as pd

s = pd.Series(
    [
        "V e  r y calm",
        "Keen and a n a l y t i c a l",
        "R a s h and careless",
        "Always joyful",
        "你 好",
        "黑 石  公 司",
    ]
)

regex = re.compile(
    "((?<![a-zA-Z]{2})(?<=[a-zA-Z]{1})\s+(?=[a-zA-Z]\s|.$)|(?<=[\u4e00-\u9fff]{1})\s+)",
    re.UNICODE,
)
s.str.replace(regex, "")

Output:

0              Very calm
1    Keen and analytical
2      Rash and careless
3          Always joyful
4                     你好
5                   黑石公司
dtype: object
Roger Z
  • 91
  • 6
  • Sorry, your solution seem not working for `FAN STUD1O`, I think maybe we need to `df['Day'].str.capitalize()` then apply your code? – ah bon Nov 16 '20 at 06:08
  • 1
    I would handle that case with a simpler regex that combines all spaces in a string `s.str.replace(re.compile("\s+", re.UNICODE), " ")` – Roger Z Nov 16 '20 at 15:54