Remove spaces and punctuations from Chinese string column in Python

Question

In order to drop duplicates from the following dataframe by news column, I try to remove all spaces and punctuations from this column.

      date                             news
0  2017-08      北京写字楼租金哪家高? 金融街、CBD、亚奥居TOP3
1  2017-08       租金一直涨,到底是谁租走了北京最贵的写字楼(附名单)
2  2017-09                 北京三季度写字楼租金继续保持平稳
3  2017-09           戴德梁行:第三季度北京写字楼市场租金保持平稳
4  2018-01  北京豪华公寓销量大涨76.5% 金融街写字楼租金创35季度新高
5  2010-11             楼市下行,高租金的商住和写字楼能不能投?

I have trying the following solutions:

df.news = df.news.apply(lambda x: re.sub(r'[^\w\s]', '', x)).replace(' ', '')
df.news = df.news.str.replace('[^\w\s]', '').str.strip()

Both generate an output with space inside the strings:

0         北京写字楼租金哪家高 金融街CBD亚奥居TOP3        ---> space in the phrase
1          租金一直涨到底是谁租走了北京最贵的写字楼附名单
2                 北京三季度写字楼租金继续保持平稳
3            戴德梁行第三季度北京写字楼市场租金保持平稳
4    北京豪华公寓销量大涨765 金融街写字楼租金创35季度新高  ---> space in the phrase
5               楼市下行高租金的商住和写字楼能不能投

The following code remove the second part of news phrases.

df.news = df.news.str.extract('(\w+)', expand = False)

0          北京写字楼租金哪家高
1               租金一直涨
2    北京三季度写字楼租金继续保持平稳
3                戴德梁行
4        北京豪华公寓销量大涨76
5                楼市下行

How can I get the expected result as follows for news column? Thank you.

0         北京写字楼租金哪家高金融街CBD亚奥居TOP3        
1          租金一直涨到底是谁租走了北京最贵的写字楼附名单
2                 北京三季度写字楼租金继续保持平稳
3            戴德梁行第三季度北京写字楼市场租金保持平稳
4    北京豪华公寓销量大涨765金融街写字楼租金创35季度新高  
5               楼市下行高租金的商住和写字楼能不能投

Looks like you want `df['news'] = df['news'].str.replace(r'[\W_]+', '')` — Wiktor Stribiżew, Jan 08 '20 at 08:55
`[^\w\s]` can't match whitespace chars as it is a negated character class matching any char but a word (letter, digit, `_` + some diacritics etc.) and whitespace chars. If you remove `\s`, it will be equal to `\W` that does not match `_`, thus `[\W_]` is what you need to only keep all alphanumeric chars. — Wiktor Stribiżew, Jan 08 '20 at 09:05
Thank you. Do you have any tutorials on this issue to recommend? — ah bon, Jan 08 '20 at 09:19

score -1 · Accepted Answer · answered Jan 08 '20 at 08:57

This seems works:

 df.news.apply(lambda x: re.sub(r'[^\w\s]', '', x)).str.replace(' ', '')

Output:

0         北京写字楼租金哪家高金融街CBD亚奥居TOP3
1         租金一直涨到底是谁租走了北京最贵的写字楼附名单
2                北京三季度写字楼租金继续保持平稳
3           戴德梁行第三季度北京写字楼市场租金保持平稳
4    北京豪华公寓销量大涨765金融街写字楼租金创35季度新高
5              楼市下行高租金的商住和写字楼能不能投

Remove spaces and punctuations from Chinese string column in Python

1 Answers1