How to filter first occurrence of Mandarin characters from a column in pandas and put that in another column

Question

I have a dataframe df :

import pandas as pd
df = pd.DataFrame({"ID": [1,2,3,4,5],
           "eng_mand" :["後山 4.7·3 reviews Community Center 竹杉園休閒農場",
                        "Taipei City 台北市Taiwan",
                        "綠山谷海芋園餐廳 3.8·52 reviews",
                        "名陽匍休閒農莊minyangpu大賞園",
                        "Menghuanhu"]})

it looks like:

   ID                                   eng_mand
0   1  後山 4.7·3 reviews Community Center 竹杉園休閒農場
1   2                      Taipei City 台北市Taiwan
2   3                    綠山谷海芋園餐廳 3.8·52 reviews
3   4                        名陽匍休閒農莊minyangpu大賞園
4   5                                 Menghuanhu

I want to filter the first occurrence of the mandarin characters from the column eng_mand and want to put that in another column mandarin_char.My final output must look like:

   ID                                   eng_mand             mandarin_char          
0   1  後山 4.7·3 reviews Community Center 竹杉園休閒農場        後山
1   2                      Taipei City 台北市Taiwan             台北市
2   3                    綠山谷海芋園餐廳 3.8·52 reviews         綠山谷海芋園餐廳
3   4                        名陽匍休閒農莊minyangpu大賞園       名陽匍休閒農莊
4   5                                 Menghuanhu

How can I do this in python - pandas

score 1 · Accepted Answer · answered Aug 01 '18 at 09:08

Use str.extract all chinese chars and add fillna for replace NaNs to empty strings if necessary:

df['mandarin_char'] = df['eng_mand'].str.extract(r'([\u4e00-\u9fff]+)').fillna('')
print (df)
   ID                                   eng_mand mandarin_char
0   1  後山 4.7·3 reviews Community Center 竹杉園休閒農場            後山
1   2                      Taipei City 台北市Taiwan           台北市
2   3                    綠山谷海芋園餐廳 3.8·52 reviews      綠山谷海芋園餐廳
3   4                        名陽匍休閒農莊minyangpu大賞園       名陽匍休閒農莊
4   5                                 Menghuanhu

EdChum · Answer 2 · 2018-08-01T09:15:12.707

Use str.findall and pass the regex for the mandarin range :

In[14]:
df['mandarin_char'] = df['eng_mand'].str.findall('[\u4e00-\u9fff]+').str[0]
df

Out[14]: 
   ID                                   eng_mand mandarin_char
0   1  後山 4.7·3 reviews Community Center 竹杉園休閒農場            後山
1   2                      Taipei City 台北市Taiwan           台北市
2   3                    綠山谷海芋園餐廳 3.8·52 reviews      綠山谷海芋園餐廳
3   4                        名陽匍休閒農莊minyangpu大賞園       名陽匍休閒農莊
4   5                                 Menghuanhu           NaN

You can call fillna('') on the result to replace NaN if required.

How to filter first occurrence of Mandarin characters from a column in pandas and put that in another column

2 Answers2