Python Pandas multiple search terms

Question

Using: Python 3.7.3, Pandas 0.24.2

I am in the process of writing some search functionality in Python using Pandas data frames.

I have a line of code that search for results containing all keywords in a list:

processed = df.loc[(df.Keywords.str.contains("magnetic")) & (df.Keywords.str.contains("board")) & (df.Keywords.str.contains("white"))]

I need to make the search terms dynamic i.e. generate an equivalent to this line based on a variable with any number of words in it.

I have managed to code this in regex however it is significantly slower than using the method above. I can pass one search term in simply enough, but not a variable number of terms.

I also have to consider the fact that search terms may be partial, i.e. search term of "agnet" should return if row contains "magnet" etc.

Any options gratefully received.

To clarify:

I've already tried using options like :

processed = df[df['Keywords'].str.contains('|'.join(search_list))]

Unfortunately this returns any row that contains any of the search terms. i.e Magnetic OR Board OR White. What I need would be to return a row that contains Magnetic AND Board AND White. Image searching for products on Amazon, that would be the closest comparison.

Results from suggestions below:

I have tested the options provided below with the following code:

search_terms = "magnetic Board White" search_terms = search_terms.lower() search_list = search_terms.split()

start_time = time.time()
processed = df.loc[(df.Keywords.str.contains("magnetic")) & (df.Keywords.str.contains("board")) & (df.Keywords.str.contains("white"))]
print("--- Original %s seconds ---" % (time.time() - start_time))

start_time = time.time()
mask = pd.concat([df['Keywords'].str.contains(x) for x in search_list], axis=1).all(axis=1)
processed = df[mask]
print("--- Concat %s seconds ---" % (time.time() - start_time))

start_time = time.time()
processed = df[np.logical_and.reduce([df['Keywords'].str.contains(x) for x in search_list])]
print("--- Numpy reduce %s seconds ---" % (time.time() - start_time))

On the data set I am using I had the following results:

--- Original 0.09292888641357422 seconds ---
--- Concat 0.09293532371520996 seconds ---
--- Numpy reduce 0.11991643905639648 seconds ---

As such I have chosen to use the Concat DataFrame.all method suggest by @jezrael.

Many thanks for the support all.

Build a function which takes a dataframe,column and a list as input. Then use the method from the linked answer to return you the rows you want. Something like: `def search_string(dataframe, col, words)`. Then as body `return dataframe[dataframe[col].str.contains('|'.join(words))]` — Erfan, Jul 10 '19 at 08:57
I don't think this is a duplicate of the linked question. OP is searching for results which contain **all** search terms, the linked question searched for **any** search term. — sjw, Jul 10 '19 at 09:05
@Matthew, could you add some example data so we can reproduce an answer for you. — Erfan, Jul 10 '19 at 09:17
Many thanks @Erfan. Is there any way I can get the duplicate flag removed? Or is it just a case of I have to get multiple people to vote re-open? — Matthew Baker, Jul 10 '19 at 09:18

score 2 · Accepted Answer · answered Jul 10 '19 at 09:22

2

Use np.logical_and.reduce for all mask in list comprehension and then filter:

processed = df[np.logical_and.reduce([df['Keywords'].str.contains(x) for x in search_list])]

Another solution with concat and DataFrame.all:

mask = pd.concat([df['Keywords'].str.contains(x) for x in search_list], axis=1).all(axis=1)
processed = df[mask]

answered Jul 10 '19 at 09:22

jezrael

822,522
95
1,334
1,252

1

Many thanks, I'll test for speed on my data set and let you know the outcome. – Matthew Baker Jul 10 '19 at 09:29
1

Both of these options worked, I found that the concat dataframe.all method was the quickest with the data sets I tested. Many thanks for the support. – Matthew Baker Jul 10 '19 at 10:17

tawab_shakeel · Answer 2 · 2019-07-10T10:14:19.227

1

try this inserting all string in list and use str.contains with join

list_str=['board', 'white', 'magnetic']
df = df[df['Keyword'].str.contains('|'.join(list_str))]

If you want closest match

results= df[np.logical_and.reduce([df['Keywords'].str.contains(x) for x in list_str])]

edited Jul 10 '19 at 10:14

answered Jul 10 '19 at 08:51

tawab_shakeel

3,701
10
26

It's worth mentioning that this essentially builds a regex, and takes advantage of `contains` knowing how to handle regexes. – DeepSpace Jul 10 '19 at 08:53
1

Thanks for the option. Please see my added clarification above - I'm looking for ALL terms not one in a list. – Matthew Baker Jul 10 '19 at 09:11

score 1 · Answer 3 · answered Jul 10 '19 at 09:22

1

I would do this using numpy.logical_and and reduce:

df = pd.DataFrame({'Keywords': ['whiteboard', 'white', 'board', 'magnetic whiteboard', 'magnet']})
search = ['white', 'board', 'magnet']

df[np.logical_and.reduce([(df.Keywords.str.contains(s)) for s in search])]

Results:

              Keywords
3  magnetic whiteboard

I haven't tested the speed but I think it'll perform well compared to regex solutions.

answered Jul 10 '19 at 09:22

sjw

6,213
2
24
39

Many thanks, I'll test for speed on my data set and let you know the outcome. – Matthew Baker Jul 10 '19 at 09:28
Great answer. I did find that the concat method listed by @jezrael was quicker however so I have accepted that as the answer but have upvoted this as working perfectly. Thanks for the help. – Matthew Baker Jul 10 '19 at 10:15

Python Pandas multiple search terms

3 Answers3