Is there a way to automate data cleaning for pandas DataFrames?

Question

I am cleaning my data for a machine learning project by replacing the missing values with the zeros and the mean for the 'Age' and 'Fare' columns respectively. The code for which is given below:

train_data['Age'] = train_data['Age'].fillna(0) 
mean = train_data['Fare'].mean()    
train_data['Fare'] = train_data['Fare'].fillna(mean)

Since I would I have to do this multiple times for other sets of data, I want to automate this process by creating a generic function that takes the DataFrame as input and performs the operations for modifying it and returning the modified function. The code for that is given below:

def data_cleaning(df):
    df['Age'] = df['Age'].fillna(0)
    fare_mean = df['Fare'].mean()
    df['Fare'] = df['Fare'].fillna()
    return df

However when I pass the training data DataFrame:

train_data = data_cleaning(train_data)

I get the following error:

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: 
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-  
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_42/1440633985.py in <module>
      1 #print(train_data)
----> 2 train_data = data_cleaning(train_data)
      3 cross_val_data = data_cleaning(cross_val_data)

/tmp/ipykernel_42/3053068338.py in data_cleaning(df)
      2     df['Age'] = df['Age'].fillna(0)
      3     fare_mean = df['Fare'].mean()
----> 4     df['Fare'] = df['Fare'].fillna()
      5     return df

/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, 
**kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in fillna(self, value, 
method, axis, inplace, limit, downcast)
   4820             inplace=inplace,
   4821             limit=limit,
-> 4822             downcast=downcast,
   4823         )
   4824 

/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in fillna(self, value, 
method, axis, inplace, limit, downcast)
   6311         """
   6312         inplace = validate_bool_kwarg(inplace, "inplace")
-> 6313         value, method = validate_fillna_kwargs(value, method)
   6314 
   6315         self._consolidate_inplace()

/opt/conda/lib/python3.7/site-packages/pandas/util/_validators.py in 
validate_fillna_kwargs(value, method, validate_scalar_dict_value)
        368 
        369     if value is None and method is None:
    --> 370         raise ValueError("Must specify a fill 'value' or 'method'.")
        371     elif value is None and method is not None:
        372         method = clean_fill_method(method)

    ValueError: Must specify a fill 'value' or 'method'.

On some research, I found that I would have to use apply() and map() functions instead, but I am not sure how to input the mean value of the column. Furthermore, this does not scale well as I would have to calculate all the fillna values before inputting them into the function, which is cumbersome. Therefore I want to ask, is there better way to automate data cleaning?

Does this answer your question? [How to deal with SettingWithCopyWarning in Pandas](https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas) — ddejohn, Oct 11 '21 at 15:52

Dhana D. · Answer 1 · 2021-10-11T15:37:22.000

0

This line df['Fare'] = df['Fare'].fillna() in your function, you did not fill the n/a with anything, thus it returns an error. You should change it to df['Fare'] = df['Fare'].fillna(fare_mean).

If you intend to make this usable for another file in same directory, you can just call it in another file by:

from file_that_contain_function import function_name

And if you intend to make it reusable for your workspace/virtual environment, you may need to create your own python package.

edited Oct 11 '21 at 15:37

answered Oct 11 '21 at 15:30

Dhana D.

1,670
3
9
33

It's shown in the error message you put in the question, are you sure that it wasn't the problem/error? – Dhana D. Oct 11 '21 at 15:34
sorry, I hadn't noticed it – Tejas Rao Oct 11 '21 at 15:35
However, I still get the warning – Tejas Rao Oct 11 '21 at 15:36
Can you please show tell the warning? – Dhana D. Oct 11 '21 at 15:38
Its the same warning that is shown in the first part of the error message – Tejas Rao Oct 11 '21 at 15:43
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy – Tejas Rao Oct 11 '21 at 15:46
You can simply ignore that or set the warning off https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas – Dhana D. Oct 11 '21 at 15:47
I recommend *not* ignoring/turning off warnings. Warnings are there for a reason... – ddejohn Oct 11 '21 at 16:34

score 0 · Answer 2 · answered Oct 11 '21 at 15:49

0

So yes, the other answer explains where the error is coming from.

However, the warning at the beginning has nothing to do with filling NaNs. The warning is telling you that you are modifying a slice of a copy of your dataframe. Change your code to

def data_cleaning(df):
    df['Age'] = df.loc[:, 'Age'].fillna(0)
    fare_mean = df['Fare'].mean()
    df['Fare'] = df.loc[:, 'Fare'].fillna(fare_mean)  # <- and also fix this error
    return df

I suggest also searching that specific warning here, as there are hundreds of posts detailing this warning and how to deal with it. Here's a good one.

answered Oct 11 '21 at 15:49

ddejohn

8,775
3
17
30

Thanks for this answer, however, I'm still getting the warning: /opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead – Tejas Rao Oct 12 '21 at 04:05
However, I have got some understanding of the issue, and will try to resolve it. – Tejas Rao Oct 12 '21 at 04:07

Is there a way to automate data cleaning for pandas DataFrames?

2 Answers2