0

The Situation:

I have a pandas DataFrame and would like to find all the entries of a certain columnn with a string that contains a specific word more than once and create a separate dataframe with from said results.

What have I done?

So far I have managed to get it to collect all the entries that contain the specified word at least once.

The Code:


    import pandas as pd
    df = pd.DataFrame({'Year': ['2020', '2021', '2021'],
                       'Title': ['Energy calculation', 'Energy calculation with energy', 'Other calculation'])
    terms = ['energy']
    list_df = selection_df[selection_df['title'].str.contains('|'.join(terms), na=False, case=False)]

The output:

0 2020 Energy calculation
1 2021 Energy calculation with energy

Then question

I would like help in collecting only the second entry:

1 2021 Energy calculation with energy

Which contains the word energy more than once. How could I do this?

Kwsswart
  • 529
  • 1
  • 7
  • 20
TeXlearner
  • 113
  • 4

3 Answers3

2

You need test each value of list separately in Series.str.count for list of masks and then use np.logical_or.reduce:

import re

terms = ['energy']
masks = [selection_df['Title'].str.count(re.escape(x), flags=re.I).gt(1) for x in terms]
list_df = selection_df[np.logical_or.reduce(masks)]
print (list_df)
 Year                           Title
1  2021  Energy calculation with energy

Alternative solution:

terms = ['energy']
masks = [selection_df['Title'].str.count(re.escape(x), flags=re.I).gt(1) for x in terms]
list_df = selection_df[pd.concat(masks, axis=1).any(axis=1)]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

You can use a regex with a capturing group and a reference:

import re
reg = r'.*(%s).*\1' % '|'.join(terms)
# line above constructs reg = '.*(energy|other|terms).*\\1'

selection_df[selection_df['Title'].str.match(reg, flags=re.I)]

output:

   Year                           Title
1  2021  Energy calculation with energy

mozway
  • 194,879
  • 13
  • 39
  • 75
1

You can use .str.extractall with collections.Counter:

import re
from collections import Counter

terms = ["energy", "calculation"]

x = (
    df["Title"]
    .str.extractall("(" + "|".join(map(re.escape, terms)) + ")", flags=re.I)
    .groupby(level=0)
    .agg(lambda x: Counter(map(str.lower, x)).most_common(1)[0][1])
)
print(df[x[0] > 1])

Prints:

   Year                           Title
1  2021  Energy calculation with energy
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91