Using: Python 3.7.3, Pandas 0.24.2
I am in the process of writing some search functionality in Python using Pandas data frames.
I have a line of code that search for results containing all keywords in a list:
processed = df.loc[(df.Keywords.str.contains("magnetic")) & (df.Keywords.str.contains("board")) & (df.Keywords.str.contains("white"))]
I need to make the search terms dynamic i.e. generate an equivalent to this line based on a variable with any number of words in it.
I have managed to code this in regex however it is significantly slower than using the method above. I can pass one search term in simply enough, but not a variable number of terms.
I also have to consider the fact that search terms may be partial, i.e. search term of "agnet" should return if row contains "magnet" etc.
Any options gratefully received.
To clarify:
I've already tried using options like :
processed = df[df['Keywords'].str.contains('|'.join(search_list))]
Unfortunately this returns any row that contains any of the search terms. i.e Magnetic OR
Board OR
White. What I need would be to return a row that contains Magnetic AND
Board AND
White. Image searching for products on Amazon, that would be the closest comparison.
Results from suggestions below:
I have tested the options provided below with the following code:
search_terms = "magnetic Board White" search_terms = search_terms.lower() search_list = search_terms.split()
start_time = time.time()
processed = df.loc[(df.Keywords.str.contains("magnetic")) & (df.Keywords.str.contains("board")) & (df.Keywords.str.contains("white"))]
print("--- Original %s seconds ---" % (time.time() - start_time))
start_time = time.time()
mask = pd.concat([df['Keywords'].str.contains(x) for x in search_list], axis=1).all(axis=1)
processed = df[mask]
print("--- Concat %s seconds ---" % (time.time() - start_time))
start_time = time.time()
processed = df[np.logical_and.reduce([df['Keywords'].str.contains(x) for x in search_list])]
print("--- Numpy reduce %s seconds ---" % (time.time() - start_time))
On the data set I am using I had the following results:
--- Original 0.09292888641357422 seconds ---
--- Concat 0.09293532371520996 seconds ---
--- Numpy reduce 0.11991643905639648 seconds ---
As such I have chosen to use the Concat DataFrame.all method suggest by @jezrael.
Many thanks for the support all.