1

I would like to search keywords from a dataframe column, called 'string'.

The keywords are contained in a dictionary.

For each key, the value is an array of several keywords.

My concern is that the speed is very low and it takes a lot of time.

Maybe there are many loops involved and df.str.contains cannot be used.

How to speed up the process?

def match(string, keyword):
        m = len(string)
        n = len(keyword)
        idx = string.find(keyword)
        if idx == -1:
            return 0
        if len(re.findall('[a-zA-Z]', string[idx])) > 0:
            if idx > 0:
                if len(re.findall('[a-zA-Z]', string[idx - 1])) > 0:
                    return 0
        if len(re.findall('[a-zA-Z]', string[idx+n-1])) > 0:
            if idx + n < m:
                if len(re.findall('[a-zA-Z]', string[idx + n])) > 0:
                    return 0
        return 1

def match_keyword(df, keyword_dict, name):
        df_new = pd.DataFrame()
        for owner_id, keyword in keyword_dict.items():
            try:
                for index, data in df.iterrows():
                    a = [match(data['string'], word) for word in keyword]
                    t = int(np.sum(a))
                    if t > 0:
                        df_new.loc[index, name+'_'+str(owner_id)] = 1
                    else:
                        df_new.loc[index, name+'_'+str(owner_id)] = 0  
            except:
                df_new[name+'_'+str(owner_id)] = 0
        return df_new.astype(int)

Input:

  String
0 New Beauty Company is now offering 超級discounts
1 Swimming is good for children and adults
2 Children love food though it may not be good

keywords:{'a':['New', 'is', '超級'], 'b':['Swim', 'discounts', 'good']}

Results:

     'New'  'is'  '超級'   result(or relation)
0     1       1      1        1
1     0       1      0        1
2     0       0      0        0

     'Swim'  'discounts'  'good' result(or relation)
0     0          1          0        1
1     0          0          1        1
2     0          0          1        1

Final results:

    'a'    'b'
0    1      1
1    1      1
2    0      1
Chan
  • 3,605
  • 9
  • 29
  • 60
  • Can you provide an example of your inputs and expected outputs? – asongtoruin Jun 12 '18 at 07:21
  • The Pandas algorithm is basic. You can use regex, but for a large speed-up you probably need a specialist library, e.g. see [this answer](https://stackoverflow.com/a/48600345/9209546). – jpp Jun 12 '18 at 08:17
  • Thank you, jpp. It is interesting. – Chan Jun 12 '18 at 08:33

1 Answers1

2

I believe need str.contains in loop by dict with word bondaries by \b with join by | for regex OR:

for k, v in keywords.items():
    pat = '|'.join(r"\b{}\b".format(x) for x in v)
    #print (pat)
    df[k] = df['String'].str.contains(pat).astype(int)

print (df)
                                         String  a  b
0  New Beauty Company is now offering discounts  1  1
1      Swimming is good for children and adults  1  1
2  Children love food though it may not be good  0  1

If need also columns by each value and create MultiIndex in columns:

df = df.set_index('String')
for k, v in keywords.items():
    for x in v:
        df[(k, x)] = df.index.str.contains(x).astype(int)

df.columns = pd.MultiIndex.from_tuples(df.columns)
print (df)
                                               a       b               
                                             New is Swim discounts good
String                                                                 
New Beauty Company is now offering discounts   1  1    0         1    0
Swimming is good for children and adults       0  1    1         0    1
Children love food though it may not be good   0  0    0         0    1

And then is possible get max by MultiIndex:

df = df.max(axis=1, level=0)
print (df)
                                              a  b
String                                            
New Beauty Company is now offering discounts  1  1
Swimming is good for children and adults      1  1
Children love food though it may not be good  0  1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you, jezrael. How about if there are Chinese characters? – Chan Jun 12 '18 at 08:32
  • @Chan - Hard question, really I have no idea. But second solution should be better then. – jezrael Jun 12 '18 at 08:33
  • I am not familiar with regex. I searched the web and found that there is `(?(?=regex)then|else)`. How to write the regex: if the first character is [a-zA-Z], then put the first `\b` else put nothing. Moreover, if the last character is [a-zA-Z], then put the last `\b` else put nothing? – Chan Jun 12 '18 at 09:27
  • @Chan - Unfortunately I have same problem with complicated regex, maybe the best create new question for this :( Sorry. – jezrael Jun 12 '18 at 09:28