How to extract specific strings from string, for each row in a dataframe; and count for each item

Question

So I have a df that looks like this:

Text     
___________________________
Hello, Jim.

I knew it was Sam! Sam why?!

I have a known list of names I want to extract from each row if they appear, and append that to a new column.

I have used this code to extract the names:

t = []
df['text'].apply(lambda x: t.append([char for char in chars if char in x]))
df['characters'] = t

Which results in:

Text                         |Characters
_____________________________|____________
Hello, Jim.                  |[Jim]
                             |
John said it was Sam! Bad Sam|[John,Sam]

But as you can see, it hasn't counted both occourances of 'Sam'. I want it to look like this:

Text                         |Characters
_____________________________|____________
Hello, Jim.                  |[Jim]
                             |
John said it was Sam! Bad Sam|[John,Sam,Sam]

Then I will be able to so a simple count for each item in the list for each row.

I'm not super familiar with lambda functions, and alot of this doesn't feel very efficient.

Any ideas?

Edit:

I can do this to get a count of one specific name for each row:

df['char_count'] = df['text'].apply(lambda x: x.count('Sam'))

But not sure how to pass in the list in the column I have generated.

I realise I changed my example text half way - please discount this — still_coding, May 14 '21 at 11:39

Mustafa Aydın · Accepted Answer · 2021-05-14T12:31:21.350

1

You can form a regex out of your list of names and let pandas find all:

names = ["Sam", "Jim"]

pattern = fr"\b({'|'.join(names)})\b"    

df["Character"] = df.Text.str.findall(pattern)

The regex \b(Sam|Jim)\b looks for either Sam or Jim (but standalone thanks to \b and @ShubhamSharma!). With findall, search is global per row.

to get

                           Text   Character
0                   Hello, Jim.       [Jim]
1  I knew it was Sam! Sam why?!  [Sam, Sam]

edited May 14 '21 at 12:31

answered May 14 '21 at 11:54

Mustafa Aydın

17,645
4
15
38

1

You can also add boundary condition in regex pattern to make sure the pattern does not matches partial words for e.g. you could do `fr"\b({'|'.join(names)})\b"` – Shubham Sharma May 14 '21 at 12:23

sandertjuh · Answer 2 · 2021-05-14T11:46:08.240

0

I suppose that happens because Sam! is not the same as Sam. One solution would be to delete all special characters from the string value before comparing all words to your list of names.

df['text'] = df['text'].str.replace('\W', '')
t = []
df['text'].apply(lambda x: t.append([char for char in chars if char in x]))
df['characters'] = t

edited May 14 '21 at 11:46

answered May 14 '21 at 11:42

sandertjuh

550
2
13

I have tries that and it isn't the case. – still_coding May 14 '21 at 11:45

How to extract specific strings from string, for each row in a dataframe; and count for each item

2 Answers2