Extract substring from text in a pandas DataFrame as new column

Question

I have a list of 'words' I want to count below

word_list = ['one','three']

And I have a column within pandas dataframe with text below.

TEXT                                       |
-------------------------------------------|
"Perhaps she'll be the one for me."        |
"Is it two or one?"                        |
"Mayhaps it be three afterall..."          |
"Three times and it's a charm."            |
"One fish, two fish, red fish, blue fish." |
"There's only one cat in the hat."         |
"One does not simply code into pandas."    |
"Two nights later..."                      |
"Quoth the Raven... nevermore."            |

The desired output is the following below, where it keeps the original text column, but only extracted the words in word_list to a new column

TEXT                                       | EXTRACT
-------------------------------------------|---------------
"Perhaps she'll be the one for me."        | one
"Is it two or one?"                        | one
"Mayhaps it be three afterall..."          | three
"Three times and it's a charm."            | three
"One fish, two fish, red fish, blue fish." | one
"There's only one cat in the hat."         | one
"One does not simply code into pandas."    | one
"Two nights later..."                      | 
"Quoth the Raven... nevermore."            |

Is there a way to do this in Python 2.7?

score 10 · Accepted Answer · answered Oct 24 '17 at 23:23

10

Use str.extract:

df['EXTRACT'] = df.TEXT.str.extract('({})'.format('|'.join(word_list)), 
                        flags=re.IGNORECASE, expand=False).str.lower().fillna('')
df['EXTRACT']

0      one
1      one
2    three
3    three
4      one
5      one
6      one
7         
8         
Name: EXTRACT, dtype: object

Each word in word_list is joined by the regex separator | and then passed to str.extract for regex pattern matching.

The re.IGNORECASE switch is turned on for case-insensitive comparisons, and the resultant matches are lowercased to match with your expected output.

answered Oct 24 '17 at 23:23

cs95

379,657
97
704
746

How about extracting more than one word from the `word_list`? – Gursel Karacor Jan 18 '20 at 17:05
1

@GurselKaracor you can look into findall or extractall. – cs95 Jan 18 '20 at 19:08
```extracted = df['TEXT'].str.findall('(' + '|'.join(word_list) + ')', flags=re.IGNORECASE) df['EXTRACT'] = extracted.str.join(',')``` – Gursel Karacor Jan 18 '20 at 21:03
Warning as follows: "A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy" – Z.LI May 17 '21 at 15:09
@Z.LI you’d only encounter that warning if you created df in a certain way. See my post on the topic for a clearer understanding: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas/53954986#53954986 – cs95 May 18 '21 at 06:18

Extract substring from text in a pandas DataFrame as new column

1 Answers1

Linked