I am learning how to handle missing values in a dataset. I have a table with ~1million entries. I'm trying to deal with a small number of missing values.
My data concerns a bicycle-share system and my missing values are start & end locations.
Data: missing starting stations, only 7 values
Data: missing ending station, 24 values altogether
I want to fill the NaN
in both cases with the mode of the "opposite" station. Example, for start_station==21
, I want to see what is the most common end_station
, and use that to fill in my missing value.
E.g. df.loc[df['start_station'] == 21].end_station.mode()
I tried to achieve this with a function:
def inpute_end_station(df):
for index, row in df.iterrows():
if pd.isnull(df.loc[index, 'end_station']):
start_st = df.loc[index, 'start_station']
mode = df.loc[df['start_station'] == start_st].end_station.mode()
df.loc[index, 'end_station'].fillna(mode, inplace=True)
The last line throws a AttributeError: 'numpy.float64' object has no attribute 'fillna'
. If instead I just use df.loc[index, 'end_station'] = mode
I get ValueError: Incompatible indexer with Series
.
Am I approaching this properly? I understand it's bad practice to modify something you're iterating over in pandas so what's the correct way of changing start_station
and end_station
columns and replacing the NaN
s with the corresponding mode of the complimentary station?