0

I have two dataframes, train and test. The test set has missing values on a column.

import numpy as np
import pandas as pd

train = [[0,1],[0,2],[0,3],[0,7],[0,7],[1,3],[1,5],[1,2],[1,2]]
test = [[0,0],[0,np.nan],[1,0],[1,np.nan]]

train = pd.DataFrame(train, columns = ['A','B'])
test = pd.DataFrame(test, columns = ['A','B'])

The test set has two missing values on column B. If the groupby column is A

  • If the imputing strategy is mode, then the missing values should be imputed with 7 and 2.
  • If the imputing strategy is mean, then the missing values should be (1+2+3+7+7)/5 = 4 and (3+5+2+2)/4 = 3.

What is a good way to do this?

This question is related, but uses only one dataframe instead of two.

ThePortakal
  • 229
  • 2
  • 10

1 Answers1

0

IIUC, here's one way:

from statistics import mode

test_mode = test.set_index('A').fillna(train.groupby('A').agg(mode)).reset_index()
test_mean = test.set_index('A').fillna(train.groupby('A').mean()).reset_index()

If you want a function:

from statistics import mode

def evaluate_nan(strategy= 'mean'):
    return test.set_index('A').fillna(train.groupby('A').agg(strategy)).reset_index()

test_mean = evaluate_nan()
test_mode = evaluate_nan(strategy = mode)
Nk03
  • 14,699
  • 2
  • 8
  • 22