re.match() in cleaning pandas data frame

Question

I want to use re.match() to clean a pandas data frame such that if an entry in any column is 1 or 2 it remains unchanged, but if it is any other value is is set to NaN.

The problem's that my function sets everything to NaN. I'm new to regular expressions so I think I've made a mistake.

Thanks!

# DATA
data = [['Bob',10,1],['Bob',2,2],['Clarke',13,1]]
my_df = pd.DataFrame(data,columns=['Name','Age','Sex'])

print(my_df)
     Name  Age  Sex
0     Bob   10    1
1     Bob    2    2
2  Clarke   13    1


# CLEANING FUNCTION
def my_fun(df):
    for col in df.columns:
            for row in df.index:                                             
                if re.match('^\d{1}(\.)\d{2}$', str(df[col][row])):       
                    df[col][row] = df[col][row]                              
                else:
                    df[col][row] = np.nan
    return(df)


# OUTPUT
my_fun(my_df)

Name    Age Sex
0   NaN NaN NaN
1   NaN NaN NaN
2   NaN NaN NaN 


# EXPECTED/DESIRED OUTPUT 

   Name  Age  Sex
0   NaN  NaN    1
1   NaN  2      2
2   NaN  NaN    1

Why do you need regex instead of using `my_df[my_df.isin([1,2])]`? — Chris, Aug 30 '20 at 12:13
Not directly answering your question, but I believe this will be easier, if replacement is your main aim. `my_df.replace([1, 2], np.nan)` This will return a replaced dataframe. The `replace` method also has an `inplace` parameter. P.S.: I notice that this seems to convert otherwise ints to floats, so watch out for that. — navneethc, Aug 30 '20 at 12:19

yatu · Accepted Answer · 2020-08-30T12:25:29.380

You can go with where with isin here for a full match:

my_df.where(my_df.isin([1,2]))

  Name  Age  Sex
0  NaN  NaN    1
1  NaN  2.0    2
2  NaN  NaN    1

Some observations:

df[col][row] not a recommended way to index a dataframe in pandas. Use .loc or .iloc, see Indexing and selecting data
Also, looping over a dataframe is generally not recommended at all, you might end up with a very poor in performance solution. I'd suggest you to read How to iterate over rows in a DataFrame in Pandas
You don't need a regex for what you want to do. You want to match either 1 or 2, there are more straight forward ways of doing this, both using python lists and Pandas. When using built-in methods to match something gets complicated, then maybe start looking into regex.

re.match() in cleaning pandas data frame

1 Answers1