is there a way in Pandas data frame where I can extract words from a column of strings that are only of length of 2 characters?
For example:
Singapore SG Jalan ID Indonesia Malaysia MY
And the results will be
SG ID MY
Use str.findall
by regex with str.join
:
df['B'] = df['A'].str.findall(r'\b[a-zA-Z]{2}\b').str.join(' ')
print (df)
A B
0 Singapore SG Jalan ID Indonesia Malaysia MY SG ID MY
1 Singapore SG Jalan SG
2 Singapore Malaysia MY MY
This might help.
df["short"] = df["test"].apply(lambda x: " ".join([i for i in x.split() if len(i) == 2]))
Output:
test short
0 Singapore SG Jalan ID Indonesia Malaysia MY SG ID MY
You can use this:
df = {'a': ['Singapore SG Jalan ID', 'SG Jalan ID Indonesia Malaysia MY'] }
df = pd.DataFrame(data=df)
a
0 Singapore SG Jalan ID
1 SG Jalan ID Indonesia Malaysia MY
df['a1'] = df['a'].str.findall(r'\b\S\S\b')
Output:
a a1
0 Singapore SG Jalan ID [SG, ID]
1 SG Jalan ID Indonesia Malaysia MY [SG, ID, MY]
Using pd.Series.str.replace
df.assign(B=df.A.str.replace('(\s*\w{3,}\s*)+', ' ').str.strip())
A B
0 Singapore SG Jalan ID Indonesia Malaysia MY SG ID MY
1 Singapore SG Jalan SG
2 Singapore Malaysia MY MY