1

I have an interesting problem, which I have fixed on a surface level, but I would like to enhance and improve my implementation.

I have a DataFrame, which holds a dataset for later Machine Learning. It has feature columns (~500 of them) and 4 columns of targets. The targets are related to each other, in an increasing granularity fashion (e.g. fault/no_fault, fault-where, fault-group, fault-exact). The DataFrame has quite a lot of NaN values, since it was compiled of 2 separate data sets via OUTER join - some rows are full, others have data from one dataset, but not the other etc. - see pic below, and sorry for terrible edits.

enter image description here

Anyway, Sci-kit Learn's SimpleImputer() Transformer did not give me the ML results I was after, and I figured that maybe I should do imputation based on targets, as in e.g. compute a median value for samples available per each target in each column, and impute these. Then check, if there are any NaN values left, and if there are, move to tar_3 (one level of granularity down), compute median as well, and impute that value against per target, per column. And so on, until no NaNs are left.

I have implemented that with the code below, which I fully understand is clunky as, and takes forever to execute:

tar_list = ['tar_4', 'tar_3', 'tar_2', 'tar_1']

for tar in tar_list:

    medians = df.groupby(by = tar).agg('median')
    print("\nFilling values based on {} column granularity.".format(tar))

    for col in [col for col in df.columns if col not in tar_list]:

        print(col)
        uniques = sorted(df[tar].unique())

        for class_name in uniques:

            value_to_fill = medians.loc[class_name][col]
            print("Setting NaNs for target {} in column {} to {}".format(class_name, col, value_to_fill))
            df.loc[df[tar] == class_name, col] = df.loc[df[tar] == class_name, col].fillna(value = value_to_fill)
    print()

While I am happy with the result this code produces, it has 2 drawbacks, which I cannot ignore: 1) It takes forever to execute even on my small ~1000 samples x ~500 columns dataset. 2) It imputes the same median value to all NaN's in each column per target value it is currently working on. I would rather prefer it to impute something with a bit of noise, to prevent just a simple repetition of the data (maybe either a value randomly selected from a normal distribution of values in that column for that target?).

As far as I am aware, there are no out-of-box tools in Sci-Kit Learn or Pandas to achieve this task in a more efficient way. However, if there are - can someone point me in the right direction? Alternatively, I am open to suggestions on how to enhance this code to address both my concerns.

UPDATE:

Code generating sample DataFrame I mentioned:

df = pd.DataFrame(np.random.randint(0, 100, size=(vsize, 10)), 
              columns = ["col_{}".format(x) for x in range(10)], 
              index = range(0, vsize * 3, 3))

df_2 = pd.DataFrame(np.random.randint(0,100,size=(vsize, 10)), 
                columns = ["col_{}".format(x) for x in range(10, 20, 1)], 
                index = range(0, vsize * 2, 2))

df = df.merge(df_2, left_index = True, right_index = True, how = 'outer')

df_tar = pd.DataFrame({"tar_1": [np.random.randint(0, 2) for x in range(vsize * 3)], 
                   "tar_2": [np.random.randint(0, 4) for x in range(vsize * 3)], 
                   "tar_3": [np.random.randint(0, 8) for x in range(vsize * 3)], 
                   "tar_4": [np.random.randint(0, 16) for x in range(vsize * 3)]})

df = df.merge(df_tar, left_index = True, right_index = True, how = 'inner')
Greem666
  • 919
  • 13
  • 24
  • 1
    Try to make a [reproduceable](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) pandas question. People will most likely be able to answer quicker because of two reasons: 1. they can copy and paste your example dataframe (we cant copy pictures). 2 if you include an _expected output_ we can visually see whats going instead of reading through lots of text and code. – Erfan May 15 '19 at 14:44
  • You are right, I have updated my question to include code generating sample DataFrame like the one I mentioned. – Greem666 May 15 '19 at 15:18

1 Answers1

1

Try this:

tar_list = ['tar_4', 'tar_3', 'tar_2', 'tar_1']
cols = [col for col in df.columns if col not in tar_list]
# since your dataframe may not have continuous index
idx = df.index

for tar in tar_list:
    medians = df[cols].groupby(by = df[tar]).agg('median')
    df.set_index(tar, inplace=True)
    for col in cols:
        df[col] = df[col].fillna(medians[col])
    df.reset_index(inplace=True)

df.index = idx

Took about 1.5s with the sample data:

np.random.seed(2019)
len_df=1000
num_cols = 500
df = pd.DataFrame(np.random.choice(list(range(10))+[np.nan],
                                   size=(len_df, num_cols),
                                   p=[0.05]*10+[0.5]),
                  columns=[str(x) for x in range(num_cols)])

for i in range(1,5):
    np.random.seed(i)
    df[f'tar_{i}'] = np.random.randint(i*4, (i+1)*4, len_df)
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • That worked beautifully mate. Great piece of code! A question though - at the moment I am just setting medians values to NaNs in each col, respective to the 'tar' value. Is there an easy way, in which I could modify this code to introduce some noise to these values? E.g. if there are 3x NaN values for column X holding 10 values in total, instead of replacing these 3 NaNs with median of the 7 values, I would like each of these 3 NaNs to be replaced with a different value +/- 0-3 x standard deviation from the median value of the 7 values in this column. – Greem666 May 16 '19 at 00:32
  • @Greem666 I'm not sure that I totally understand your question. But it looks like you can compute the ratios of `NaN` for each column, modify `medians` for those columns satisfying the ratio conditions, then assign. – Quang Hoang May 16 '19 at 13:07
  • An example of what I have in mind is as follows: col_1 is getting its NaNs filled with a median of non-NaN values in that column. There are 3 non-NaN values, and 2 NaN values. The Median of non-NaN columns = 5, and standard error = 1. First NaN in that column gets assigned 5 + (3 * 1), the second NaN gets 5 - (2* 1). This way, both NaNs are filled with values belonging to the distribution of non-NaN values in that column. At the moment, the script fills all NaNs in col_1 with the same value (median in this case). – Greem666 May 16 '19 at 23:50
  • I think I understand the problem now, but I also believe it’s worthy of a new question. – Quang Hoang May 16 '19 at 23:53
  • Ok, I`ll ask a new question then. – Greem666 May 17 '19 at 00:29
  • Here's my new question: https://stackoverflow.com/questions/56178297/pandas-variable-fillna-in-each-column – Greem666 May 17 '19 at 01:02