Pandas: Find data frame string entries that contain a specific word more than once

Question

The Situation:

I have a pandas DataFrame and would like to find all the entries of a certain columnn with a string that contains a specific word more than once and create a separate dataframe with from said results.

What have I done?

So far I have managed to get it to collect all the entries that contain the specified word at least once.

The Code:


    import pandas as pd
    df = pd.DataFrame({'Year': ['2020', '2021', '2021'],
                       'Title': ['Energy calculation', 'Energy calculation with energy', 'Other calculation'])
    terms = ['energy']
    list_df = selection_df[selection_df['title'].str.contains('|'.join(terms), na=False, case=False)]

The output:

0 2020 Energy calculation
1 2021 Energy calculation with energy

Then question

I would like help in collecting only the second entry:

1 2021 Energy calculation with energy

Which contains the word energy more than once. How could I do this?

jezrael · Accepted Answer · 2021-09-17T10:08:57.240

You need test each value of list separately in Series.str.count for list of masks and then use np.logical_or.reduce:

import re

terms = ['energy']
masks = [selection_df['Title'].str.count(re.escape(x), flags=re.I).gt(1) for x in terms]
list_df = selection_df[np.logical_or.reduce(masks)]
print (list_df)
 Year                           Title
1  2021  Energy calculation with energy

Alternative solution:

terms = ['energy']
masks = [selection_df['Title'].str.count(re.escape(x), flags=re.I).gt(1) for x in terms]
list_df = selection_df[pd.concat(masks, axis=1).any(axis=1)]

score 2 · Answer 2 · answered Sep 17 '21 at 11:20

You can use a regex with a capturing group and a reference:

import re
reg = r'.*(%s).*\1' % '|'.join(terms)
# line above constructs reg = '.*(energy|other|terms).*\\1'

selection_df[selection_df['Title'].str.match(reg, flags=re.I)]

output:

   Year                           Title
1  2021  Energy calculation with energy

Andrej Kesely · Answer 3 · 2021-09-17T09:58:58.793

1

You can use .str.extractall with collections.Counter:

import re
from collections import Counter

terms = ["energy", "calculation"]

x = (
    df["Title"]
    .str.extractall("(" + "|".join(map(re.escape, terms)) + ")", flags=re.I)
    .groupby(level=0)
    .agg(lambda x: Counter(map(str.lower, x)).most_common(1)[0][1])
)
print(df[x[0] > 1])

Prints:

   Year                           Title
1  2021  Energy calculation with energy

edited Sep 17 '21 at 09:58

answered Sep 17 '21 at 09:50

Andrej Kesely

168,389
15
48
91

if multiple values in list this failed. – jezrael Sep 17 '21 at 09:50
@jezrael I've edited it, but your solution is more elegant :) – Andrej Kesely Sep 17 '21 at 09:59
ya, performance here should be problem. – jezrael Sep 17 '21 at 10:00

Pandas: Find data frame string entries that contain a specific word more than once

3 Answers3