1

I want to use re.match() to clean a pandas data frame such that if an entry in any column is 1 or 2 it remains unchanged, but if it is any other value is is set to NaN.

The problem's that my function sets everything to NaN. I'm new to regular expressions so I think I've made a mistake.

Thanks!

# DATA
data = [['Bob',10,1],['Bob',2,2],['Clarke',13,1]]
my_df = pd.DataFrame(data,columns=['Name','Age','Sex'])

print(my_df)
     Name  Age  Sex
0     Bob   10    1
1     Bob    2    2
2  Clarke   13    1


# CLEANING FUNCTION
def my_fun(df):
    for col in df.columns:
            for row in df.index:                                             
                if re.match('^\d{1}(\.)\d{2}$', str(df[col][row])):       
                    df[col][row] = df[col][row]                              
                else:
                    df[col][row] = np.nan
    return(df)


# OUTPUT
my_fun(my_df)

Name    Age Sex
0   NaN NaN NaN
1   NaN NaN NaN
2   NaN NaN NaN 


# EXPECTED/DESIRED OUTPUT 

   Name  Age  Sex
0   NaN  NaN    1
1   NaN  2      2
2   NaN  NaN    1
yatu
  • 86,083
  • 12
  • 84
  • 139
Robbie
  • 275
  • 4
  • 20
  • Why do you need regex instead of using `my_df[my_df.isin([1,2])]`? – Chris Aug 30 '20 at 12:13
  • Not directly answering your question, but I believe this will be easier, if replacement is your main aim. `my_df.replace([1, 2], np.nan)` This will return a replaced dataframe. The `replace` method also has an `inplace` parameter. P.S.: I notice that this seems to convert otherwise ints to floats, so watch out for that. – navneethc Aug 30 '20 at 12:19

1 Answers1

2

You can go with where with isin here for a full match:

my_df.where(my_df.isin([1,2]))

  Name  Age  Sex
0  NaN  NaN    1
1  NaN  2.0    2
2  NaN  NaN    1

Some observations:

  • df[col][row] not a recommended way to index a dataframe in pandas. Use .loc or .iloc, see Indexing and selecting data

  • Also, looping over a dataframe is generally not recommended at all, you might end up with a very poor in performance solution. I'd suggest you to read How to iterate over rows in a DataFrame in Pandas

  • You don't need a regex for what you want to do. You want to match either 1 or 2, there are more straight forward ways of doing this, both using python lists and Pandas. When using built-in methods to match something gets complicated, then maybe start looking into regex.

yatu
  • 86,083
  • 12
  • 84
  • 139