1

I have a huge Data frame which has 3M records which has column called description. Also I have possible sub string set of around 5k.

I want to get the rows in which the description contains any of the sub string.

i used the following looping

for i in range(0,len(searchstring)): 
 ss=searchsting[i]
 for k in range(0,len(df)):
    desc=df['description'].iloc[k].lower()
    if (bool(re.search(ss,desc))):
        trans.append(df.iloc[k])

The issue is it is taking too much time as the search 5k times 3M looping.

Is there any better way to search substring?

Manu Mohan
  • 167
  • 3
  • 17
  • I would suggest using a HashMap for your larger set of records, this will be expensive in memory. But then can check if a record exists within this set and that would be in 0(1) time for each record. – JahKnows Apr 13 '18 at 04:37
  • You use `re.search` ... are these substrings regular expressions? – tdelaney Apr 13 '18 at 05:06
  • these are substrings only ss=re.compile(re.escape(searchstring[]i.lower())) – Manu Mohan Apr 13 '18 at 05:23
  • Assuming that you have a 5k list of search strings (none of this re.compile business), I think you can select them in the dataframe with `df['description'].str.lower().isin(search_strings_list)` and you can use that selection to remove non-conforming rows from the dataframe. So, `df[df['description'].str.lower().isin(search_strings_list)]`. its easy to try... let me know if it works! – tdelaney Apr 13 '18 at 05:42
  • It might even be possible to do `isin(set(search_strings_list))` for a bit of a boost. – tdelaney Apr 13 '18 at 05:43
  • `df[df.isin({'description':sub_string_set}).any(1)]` ... [Indexing with isin](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-with-isin). - `sub_string_set` can be a set or a list. – wwii Apr 13 '18 at 05:57
  • @tdelaney didnt work. I got empty string as i dont have any exact match in the description column. I guess isin only returns the if the enter value matches not the substring. I'm not an expert python user. – Manu Mohan Apr 13 '18 at 06:12
  • @wwii Im not sure about indexing in isin. I tried the code you have given but got the following error.SystemError: returned a result with an error set – Manu Mohan Apr 13 '18 at 06:13
  • Possible duplicate of [Select rows from a DataFrame based on values in a column in pandas](https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas). The accepted answer has an `isin()` solution/example which is relevant to this question. – wwii Apr 13 '18 at 14:31

2 Answers2

0

Should be faster if you use pandas isin() function

Example:

import pandas as pd
a ='Hello world'
ss = a.split(" ")

df = pd.DataFrame({'col1': ['Hello', 'asd', 'asdasd', 'world']})
df.loc[df['col1'].isin(ss)].index

Returns a list of indexes:

Int64Index([0, 3], dtype='int64')
shep4rd
  • 86
  • 9
  • 1
    I guess is in wont search for substring in a particular column. – Manu Mohan Apr 13 '18 at 05:14
  • I think it is searching in a particular column if you call isin() on the column. I did some simple checks while giving this answer and if was possible searching on one column – shep4rd Apr 13 '18 at 05:46
  • [Indexing with isin](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-with-isin) – wwii Apr 13 '18 at 05:55
0

I found an alternative method. I have created a word dictionary for the description column of 3M data set, by splitting each words. (I have replaced numbers in the description with zeros and used it for dictionary generation)

def tokenize(desc):

   desc=re.sub('\d', '0', desc)
   tokens=re.split('\s+',desc)
   return tokens

def make_inv_index(df):
  inv_index={}
  for i,tokens in df['description_removed_numbers'].iteritems():
     for token in tokens:
         try:
              inv_index[token].append(i)
         except KeyError:
              inv_index[token]=[i]

  return inv_index
df['description_removed_numbers']=df['description'].apply(tokenize)
inv_index_df=make_inv_index(df)

Now while searching the description, have to apply the same tokenization on the the search string and will take intersection of indexes of that particular words using the dictionary and will search only those fields. This substantially reduced my overall time taken to run the program.

Manu Mohan
  • 167
  • 3
  • 17