1

I have a search list in a column which may contain a key: 'keyword1*keyword2' to try to find the match in a separate dataframe column. How can I include the regex wildcard type 'keyword1.*keyword2' #using str.extract, extractall or findall?

Using .str.extract works great matching exact substrings but I need it to also match substrings with wildcards in between the keyword.

# dataframe column or series list as keys to search for: 
dfKeys = pd.DataFrame()
dfKeys['SearchFor'] = ['this', 'Something', 'Second', 'Keyword1.*Keyword2', 'Stuff', 'One' ]

# col_next_to_SearchFor_col
dfKeys['AdjacentCol'] = ['this other string', 'SomeString Else', 'Second String Player', 'Keyword1 Keyword2', 'More String Stuff', 'One More String Example' ]

# dataframe column to search in: 
df1['Description'] = ['Something Here','Second Item 7', 'Something There', 'strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 andMORE b4END', 'Second Item 7', 'Even More Stuff']]

# I've tried:
df1['Matched'] = df1['Description'].str.extract('(%s)' % '|'.join(key['searchFor']), flags=re.IGNORECASE, expand=False)

I've also tried substituting 'extract' from the code above with both 'extractall' and 'findall' but it still does not give me the results I need. I expected 'Keyword1*Keyword2' to match "strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 andMORE b4END"

UPDATE: The '.*' worked! I'm also trying to add the value from the cell next to the matched key in 'SearchFor' column i.e. dfKeys['AdjacentCol'].

I've tried: df1['From_AdjacentCol'] = df1['Description'].str.extract('(%s)' % '|'.join(key['searchFor']), flags=re.IGNORECASE, expand=False).map(dfKeys.set_index('SearchFor')['AdjacentCol'].to_dict()).fillna('') which works for everything but the keys with the wildcards.

# expected:
  Description                                      Matched            From_AdjacentCol
0 'Something Here'                                 'Something'         'this other string'
1 'Second Item 7'                                  'Second'            'Second String Player'
2 'Something There'                                'Something'         'this other string'  
3 'strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2...' 'Keyword1*Keyword2' 'Keyword1 Keyword2'
4 'Second Item 7'                                  'Second'            'Second String Player'
5 'Even More Stuff'                                'Stuff'             'More String Stuff'

Any help with this is much appreciated. thanks!

lodestar08
  • 21
  • 1
  • 5
  • @Wiktor Stribiżew I haven't come across some similar questions but none of them address the wildcard aspect that I was having trouble with. – lodestar08 Apr 14 '19 at 14:37
  • You are using `.str.extract` that does not use wildcards, it uses regex. Matching any chars between two strings is a too frequent regex issue, hence the duplicate reason. – Wiktor Stribiżew Apr 14 '19 at 14:46

1 Answers1

1

Solution

You are close to the solution, just change * to .*. Reading the docs:

. (Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

* Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.

In Regular Expression star symbol * alone means nothing. It has a different meaning than the usual glob operator * in Unix/Windows file systems.

Star symbol is a quantifier (namely the gready quantifier), it must be associated to some pattern (here . to match any character) to mean something.

MCVE

Reshaping your MCVE:

import re
import pandas as pd

keys = ['this', 'Something', 'Second', 'Keyword1.*Keyword2', 'Stuff', 'One' ]

df1 = pd.DataFrame()
df1['Description'] = ['Something Here','Second Item 7', 'Something There',
                      'strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 andMORE b4END',
                      'Second Item 7', 'Even More Stuff']


regstr = '(%s)' % '|'.join(keys)

df1['Matched'] = df1['Description'].str.extract(regstr, flags=re.IGNORECASE, expand=False)

The regexp is now:

(this|Something|Second|Keyword1.*Keyword2|Stuff|One)

And matches the missing case:

                                         Description                                Matched
0                                     Something Here                              Something
1                                      Second Item 7                                 Second
2                                    Something There                              Something
3  strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 an...  KEYWORD1 moreJARGON 06/0 010 KEYWORD2
4                                      Second Item 7                                 Second
5                                    Even More Stuff                                  Stuff
jlandercy
  • 7,183
  • 1
  • 39
  • 57
  • Thanks! that's what it was. Any chance you know why `df1['col_Next_toMatched'] = df1['Description'].str.extract(regstr, flags=re.IGNORECASE, expand=False).map(dfKeys.set_index('keys')['col_Next_toKeys'].to_dict()).fillna('')` works for everything but the keys with wildcards? – lodestar08 Apr 14 '19 at 14:33
  • @lodestar08 Could you update your post to show what is `dfKeys`? – jlandercy Apr 14 '19 at 15:14
  • I've updated the post - but not sure if that made it any clearer – lodestar08 Apr 14 '19 at 20:25