I have an interesting problem, which I have fixed on a surface level, but I would like to enhance and improve my implementation.
I have a DataFrame, which holds a dataset for later Machine Learning. It has feature columns (~500 of them) and 4 columns of targets. The targets are related to each other, in an increasing granularity fashion (e.g. fault/no_fault, fault-where, fault-group, fault-exact). The DataFrame has quite a lot of NaN values, since it was compiled of 2 separate data sets via OUTER join - some rows are full, others have data from one dataset, but not the other etc. - see pic below, and sorry for terrible edits.
Anyway, Sci-kit Learn's SimpleImputer() Transformer did not give me the ML results I was after, and I figured that maybe I should do imputation based on targets, as in e.g. compute a median value for samples available per each target in each column, and impute these. Then check, if there are any NaN values left, and if there are, move to tar_3 (one level of granularity down), compute median as well, and impute that value against per target, per column. And so on, until no NaNs are left.
I have implemented that with the code below, which I fully understand is clunky as, and takes forever to execute:
tar_list = ['tar_4', 'tar_3', 'tar_2', 'tar_1']
for tar in tar_list:
medians = df.groupby(by = tar).agg('median')
print("\nFilling values based on {} column granularity.".format(tar))
for col in [col for col in df.columns if col not in tar_list]:
print(col)
uniques = sorted(df[tar].unique())
for class_name in uniques:
value_to_fill = medians.loc[class_name][col]
print("Setting NaNs for target {} in column {} to {}".format(class_name, col, value_to_fill))
df.loc[df[tar] == class_name, col] = df.loc[df[tar] == class_name, col].fillna(value = value_to_fill)
print()
While I am happy with the result this code produces, it has 2 drawbacks, which I cannot ignore: 1) It takes forever to execute even on my small ~1000 samples x ~500 columns dataset. 2) It imputes the same median value to all NaN's in each column per target value it is currently working on. I would rather prefer it to impute something with a bit of noise, to prevent just a simple repetition of the data (maybe either a value randomly selected from a normal distribution of values in that column for that target?).
As far as I am aware, there are no out-of-box tools in Sci-Kit Learn or Pandas to achieve this task in a more efficient way. However, if there are - can someone point me in the right direction? Alternatively, I am open to suggestions on how to enhance this code to address both my concerns.
UPDATE:
Code generating sample DataFrame I mentioned:
df = pd.DataFrame(np.random.randint(0, 100, size=(vsize, 10)),
columns = ["col_{}".format(x) for x in range(10)],
index = range(0, vsize * 3, 3))
df_2 = pd.DataFrame(np.random.randint(0,100,size=(vsize, 10)),
columns = ["col_{}".format(x) for x in range(10, 20, 1)],
index = range(0, vsize * 2, 2))
df = df.merge(df_2, left_index = True, right_index = True, how = 'outer')
df_tar = pd.DataFrame({"tar_1": [np.random.randint(0, 2) for x in range(vsize * 3)],
"tar_2": [np.random.randint(0, 4) for x in range(vsize * 3)],
"tar_3": [np.random.randint(0, 8) for x in range(vsize * 3)],
"tar_4": [np.random.randint(0, 16) for x in range(vsize * 3)]})
df = df.merge(df_tar, left_index = True, right_index = True, how = 'inner')