2

I have a DataFrame that I would like to filter out "bad data" with a regex. In my use case any number in column_b that has 4 identical numbers in a row is considered "bad".

Here is my code:

import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,5,np.NaN], 
                    'B' : [1111111,1234567,2222,55555,0,0,np.NaN,9,0,0], 
                    'E' : ['Assign','Unassign','Assign','Ugly','Appreciate','Undo','Assign','Unicycle','Assign','Unicorn',]})
print(df1)

bad_data = df1[df1['B'].astype(str).str.contains(r'(\d)\1{3,}')]
print(bad_data)

     A          B       E
0  NaN  1111111.0  Assign
2  3.0     2222.0  Assign
3  4.0    55555.0    Ugly

My code works. But I get this UserWarning: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.

This was talked about here. Following that example.I changed my regex to use a noncapturing group (?...):

bad_data = df1[df1['B'].astype(str).str.contains(r'(?:(\d))\1{3,}')] 

But I still recieve the UserWarning. No matter where or how many non caputring groups i try. I could filter out the warning like in the other link. But is there something I am doing wrong/could be doing better that keeps the Warning from popping up

Community
  • 1
  • 1
MattR
  • 4,887
  • 9
  • 40
  • 67
  • 1
    Yes, if you have a capturing group that you need to use a backreference to inside the pattern, you can't help it. – Wiktor Stribiżew Apr 07 '17 at 16:47
  • 1
    with pandas==0.19.2 I don't see any issue with second line of code that you use - regex to use a noncapturing group . – Shijo Apr 07 '17 at 17:24

0 Answers0