I have a DataFrame that I would like to filter out "bad data" with a regex. In my use case any number in column_b
that has 4 identical numbers in a row is considered "bad".
Here is my code:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,5,np.NaN],
'B' : [1111111,1234567,2222,55555,0,0,np.NaN,9,0,0],
'E' : ['Assign','Unassign','Assign','Ugly','Appreciate','Undo','Assign','Unicycle','Assign','Unicorn',]})
print(df1)
bad_data = df1[df1['B'].astype(str).str.contains(r'(\d)\1{3,}')]
print(bad_data)
A B E
0 NaN 1111111.0 Assign
2 3.0 2222.0 Assign
3 4.0 55555.0 Ugly
My code works. But I get this UserWarning: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
This was talked about here. Following that example.I changed my regex to use a noncapturing group (?...)
:
bad_data = df1[df1['B'].astype(str).str.contains(r'(?:(\d))\1{3,}')]
But I still recieve the UserWarning. No matter where or how many non caputring groups i try. I could filter out the warning like in the other link. But is there something I am doing wrong/could be doing better that keeps the Warning from popping up