1

I have a Pandads Dataframe where one column ('processed') is a single string containing some pre-processed text of varying length.

I want to search using a list of keywords, of arbitary length, to return only the processed notes for rows where the string 'processed' contains ALL of the elements in the list.

Of course, I can search the terms individually, like:

words = ['searchterm1', 'searchterm2']
notes = df.loc[(df.processed.str.contains(words[0])) & (df.processed.str.contains(words[1]))].processed

But this seems inefficient, and would require different code depending on the number of search terms I'm using.

What I'm looking for is something like....

notes = (df.loc[[(df.processed.str.contains(words[i])) for i in range(len(words))]]).processed

Which would include

"searchterm1 foo bar searchterm"

but NOT include

"foo bar searchterm1"

or

"searchterm2".

But this doesn't work - loc doesn't support the generator object or list as input.

So what is the best way to find a string that contains multiple substrings? Thanks!

user3140106
  • 347
  • 4
  • 16
  • Possible duplicate of [Searching Multiple Strings in pandas without predefining number of strings to use](https://stackoverflow.com/questions/22623977/searching-multiple-strings-in-pandas-without-predefining-number-of-strings-to-us) – asongtoruin Sep 11 '18 at 12:09
  • Are you looking for **any** (i.e. at least one) or **all** substrings to match? – jpp Sep 11 '18 at 12:24
  • 1
    Looking for all to match. Will edit question to clarify. – user3140106 Sep 11 '18 at 12:38
  • Possible duplicates of [pandas dataframe str.contains() AND operation](https://stackoverflow.com/questions/37011734/pandas-dataframe-str-contains-and-operation) – jpp Sep 11 '18 at 13:09

1 Answers1

2

Example data:

df = pd.DataFrame(data=[[1,'a', 3],
                   [1,'b', 4],
                   [2,'c', 22],
                   [2,'s', 3],
                   [2,'f', 3],
                   [1,'d', 56]], 
             columns = ['group', 'value', 'value2'])

words = ['two', 'three', 'two']

Output:

  processed
0       one
1       two
2     three
3   one one
4  two, one

I modify your code raw:

notes = df.loc[sum([df.processed.str.contains(word) for word in words]) > 0]

Output:

  processed
1       two
2     three
4  two, one