1

Using: Python 3.7.3, Pandas 0.24.2

I am in the process of writing some search functionality in Python using Pandas data frames.

I have a line of code that search for results containing all keywords in a list:

processed = df.loc[(df.Keywords.str.contains("magnetic")) & (df.Keywords.str.contains("board")) & (df.Keywords.str.contains("white"))]

I need to make the search terms dynamic i.e. generate an equivalent to this line based on a variable with any number of words in it.

I have managed to code this in regex however it is significantly slower than using the method above. I can pass one search term in simply enough, but not a variable number of terms.

I also have to consider the fact that search terms may be partial, i.e. search term of "agnet" should return if row contains "magnet" etc.

Any options gratefully received.

To clarify:

I've already tried using options like :

processed = df[df['Keywords'].str.contains('|'.join(search_list))]

Unfortunately this returns any row that contains any of the search terms. i.e Magnetic OR Board OR White. What I need would be to return a row that contains Magnetic AND Board AND White. Image searching for products on Amazon, that would be the closest comparison.

Results from suggestions below:

I have tested the options provided below with the following code:

search_terms = "magnetic Board White" search_terms = search_terms.lower() search_list = search_terms.split()

start_time = time.time()
processed = df.loc[(df.Keywords.str.contains("magnetic")) & (df.Keywords.str.contains("board")) & (df.Keywords.str.contains("white"))]
print("--- Original %s seconds ---" % (time.time() - start_time))

start_time = time.time()
mask = pd.concat([df['Keywords'].str.contains(x) for x in search_list], axis=1).all(axis=1)
processed = df[mask]
print("--- Concat %s seconds ---" % (time.time() - start_time))

start_time = time.time()
processed = df[np.logical_and.reduce([df['Keywords'].str.contains(x) for x in search_list])]
print("--- Numpy reduce %s seconds ---" % (time.time() - start_time))

On the data set I am using I had the following results:

--- Original 0.09292888641357422 seconds ---
--- Concat 0.09293532371520996 seconds ---
--- Numpy reduce 0.11991643905639648 seconds ---

As such I have chosen to use the Concat DataFrame.all method suggest by @jezrael.

Many thanks for the support all.

Matthew Baker
  • 2,637
  • 4
  • 24
  • 49
  • Build a function which takes a dataframe,column and a list as input. Then use the method from the linked answer to return you the rows you want. Something like: `def search_string(dataframe, col, words)`. Then as body `return dataframe[dataframe[col].str.contains('|'.join(words))]` – Erfan Jul 10 '19 at 08:57
  • I don't think this is a duplicate of the linked question. OP is searching for results which contain **all** search terms, the linked question searched for **any** search term. – sjw Jul 10 '19 at 09:05
  • You are right @thesilkworm, voted for reopen – Erfan Jul 10 '19 at 09:11
  • @Matthew, could you add some example data so we can reproduce an answer for you. – Erfan Jul 10 '19 at 09:17
  • Many thanks @Erfan. Is there any way I can get the duplicate flag removed? Or is it just a case of I have to get multiple people to vote re-open? – Matthew Baker Jul 10 '19 at 09:18
  • 1
    reopened question – jezrael Jul 10 '19 at 09:20

3 Answers3

2

Use np.logical_and.reduce for all mask in list comprehension and then filter:

processed = df[np.logical_and.reduce([df['Keywords'].str.contains(x) for x in search_list])]

Another solution with concat and DataFrame.all:

mask = pd.concat([df['Keywords'].str.contains(x) for x in search_list], axis=1).all(axis=1)
processed = df[mask]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

try this inserting all string in list and use str.contains with join

list_str=['board', 'white', 'magnetic']
df = df[df['Keyword'].str.contains('|'.join(list_str))]

If you want closest match

results= df[np.logical_and.reduce([df['Keywords'].str.contains(x) for x in list_str])]
tawab_shakeel
  • 3,701
  • 10
  • 26
  • It's worth mentioning that this essentially builds a regex, and takes advantage of `contains` knowing how to handle regexes. – DeepSpace Jul 10 '19 at 08:53
  • 1
    Thanks for the option. Please see my added clarification above - I'm looking for ALL terms not one in a list. – Matthew Baker Jul 10 '19 at 09:11
1

I would do this using numpy.logical_and and reduce:

df = pd.DataFrame({'Keywords': ['whiteboard', 'white', 'board', 'magnetic whiteboard', 'magnet']})
search = ['white', 'board', 'magnet']

df[np.logical_and.reduce([(df.Keywords.str.contains(s)) for s in search])]

Results:

              Keywords
3  magnetic whiteboard

I haven't tested the speed but I think it'll perform well compared to regex solutions.

sjw
  • 6,213
  • 2
  • 24
  • 39
  • Many thanks, I'll test for speed on my data set and let you know the outcome. – Matthew Baker Jul 10 '19 at 09:28
  • Great answer. I did find that the concat method listed by @jezrael was quicker however so I have accepted that as the answer but have upvoted this as working perfectly. Thanks for the help. – Matthew Baker Jul 10 '19 at 10:15