5

I have a dataframe column with variable comma separated text and just trying to extract the values that are found based on another list. So my dataframe looks like this:

col1 | col2
-----------
 x   | a,b


listformatch = [c,d,f,b]
pattern = '|'.join(listformatch)

def test_for_pattern(x):
    if re.search(pattern, x):
        return pattern
    else:
        return x

#also can use col2.str.contains(pattern) for same results

The above filtering works great but instead of returning b when it finds the match it returns the whole pattern such as a|b instead of just b whereas I want to create another column with the pattern it finds such as b.

Here is my final function but still getting UserWarning: This pattern has match groups. To actually get the groups, use str.extract." groups, use str.extract.", UserWarning) I wish I can solve:

def matching_func(file1, file2):
    file1 = pd.read_csv(fin)
    file2 = pd.read_excel(fin1, 0, skiprows=1)
    pattern = '|'.join(file1[col1].tolist())
    file2['new_col'] = file2[col1].map(lambda x: re.search(pattern, x).group()\
                                             if re.search(pattern, x) else None)

I think I understand how pandas extract works now but probably still rusty on regex. How do I create a pattern variable to use for the below example:

df[col1].str.extract('(word1|word2)')

Instead of having the words in the argument, I want to create variable as pattern = 'word1|word2' but that won't work because of the way the string is being created.

My final and preferred version with vectorized string method in pandas 0.13:

Using values from one column to extract from a second column:

df[col1].str.extract('({})'.format('|'.join(df[col2]))
horatio1701d
  • 8,809
  • 14
  • 48
  • 77

1 Answers1

3

You might like to use extract, or one of the other vectorised string methods:

In [11]: s = pd.Series(['a', 'a,b'])

In [12]: s.str.extract('([cdfb])')
Out[12]:
0    NaN
1      b
dtype: object
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • extract seems great. how would I use it though if I am getting the string matches from another dataframe column. In other words, for my function above I did `'|'.join(df[col1].tolist())` to get my pattern. – horatio1701d Mar 28 '14 at 11:14
  • any idea how I can get rid of this message from my program: `UserWarning: This pattern has match groups. To actually get the groups, use str.extract." groups, use str.extract.", UserWarning)` – horatio1701d Mar 28 '14 at 15:18
  • @prometheus2305 yup, put parentheses around what you're trying to find (as in my example) :) – Andy Hayden Mar 28 '14 at 16:28
  • @prometheus2305 a DataFrame column is just a Series, so you can do `df[col1].str.extract('([cdfb])')`. – Andy Hayden Mar 28 '14 at 16:30
  • Thanks for the help. If I wanted to extract phrases that appear in a column based on passing an argument to extract with all of the potential phrases to find how would I construct the variable? I've added the example to the question. – horatio1701d Mar 29 '14 at 12:45
  • 1
    @prometheus2305 I think you're looking for `'(%s)' % '|'.join(patterns)` where `patterns = ['word1', 'word2']` ? – Andy Hayden Mar 29 '14 at 18:15
  • Wow. this code seems so strange but perfectly worked for what I needed. Thank you! `pattern = '({})'.format('|'.join(df[col1].tolist()))` – horatio1701d Mar 29 '14 at 20:57
  • @prometheus2305 note that the tolist call is not needed :) – Andy Hayden Mar 29 '14 at 21:21