1

I am learning how to handle missing values in a dataset. I have a table with ~1million entries. I'm trying to deal with a small number of missing values.

My data concerns a bicycle-share system and my missing values are start & end locations.

Data: missing starting stations, only 7 values

enter image description here

Data: missing ending station, 24 values altogether

enter image description here

I want to fill the NaN in both cases with the mode of the "opposite" station. Example, for start_station==21, I want to see what is the most common end_station, and use that to fill in my missing value. E.g. df.loc[df['start_station'] == 21].end_station.mode()

I tried to achieve this with a function:

def inpute_end_station(df):
    for index, row in df.iterrows():    
        if pd.isnull(df.loc[index, 'end_station']):

            start_st = df.loc[index, 'start_station']
            mode = df.loc[df['start_station'] == start_st].end_station.mode()
            df.loc[index, 'end_station'].fillna(mode, inplace=True)

The last line throws a AttributeError: 'numpy.float64' object has no attribute 'fillna'. If instead I just use df.loc[index, 'end_station'] = mode I get ValueError: Incompatible indexer with Series.

Am I approaching this properly? I understand it's bad practice to modify something you're iterating over in pandas so what's the correct way of changing start_station and end_station columns and replacing the NaNs with the corresponding mode of the complimentary station?

Krantz
  • 1,424
  • 1
  • 12
  • 31
Bn.F76
  • 783
  • 2
  • 12
  • 30
  • Seems similar to this question asked yesterday: https://stackoverflow.com/questions/55562696/how-to-replace-missing-values-with-group-mode-in-pandas. Should be able to use that same function (just with different column names), then map the result. – ALollz Apr 10 '19 at 01:06
  • @ALollz the solution there does work with the helper function but I get a `SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead`. Is this really the best way? – Bn.F76 Apr 10 '19 at 01:17
  • That error is likely unrelated to this, and stems from a prior slicing operation that generated a copy. You can read this for some more background in https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas, but the TLDR band-aid fix is that you should probably just do `df = df.copy()` prior to whichever line of code threw that warning. In reality, you should fix the slice operation somewhere else, which probably requires a `.copy()` after some masking step, or to re-write code without chained assignment. – ALollz Apr 10 '19 at 01:20

1 Answers1

1

In my opinion, when you want to iterate over a column in pandas like this, the best practice is using apply() function.

For this particular case, I would suggest the following approach, that is shown below on my sample data. I do not have much experience using mode() method, so I used value_counts() method in combination with first_valid_index() method in order to determine mode value.

# import pandas
import pandas as pd

# make a sample data
list_of_rows = [
  {'start_station': 1, 'end_station': 1},
  {'start_station': None, 'end_station': 1},
  {'start_station': 1, 'end_station': 2},
  {'start_station': 1, 'end_station': 3},
  {'start_station': 2, 'end_station': None},
  {'start_station': 2, 'end_station': 3},
  {'start_station': 2, 'end_station': 3},
]

# make a pandas data frame
df = pd.DataFrame(list_of_rows)

# define a function
def fill_NaNs_in_end_station(row):
    if pd.isnull(row['end_station']):
        start_station = row['start_station']
        return df[df['start_station']==start_station].end_station.value_counts().first_valid_index()
    return row['end_station']

# apply function to dataframe
df['end_station'] = df.apply(lambda row: fill_NaNs_in_end_station(row), axis=1)
Jaroslav Bezděk
  • 6,967
  • 6
  • 29
  • 46