2

I am currently have a large dataset with quite a few missing values.

I'm trying to fill in these missing values by creating a random distribution with the data I have and sampling it. Eg create a random distribution then randomly choose a number from 0 to 1 and fill in the missing data with the corresponding value

I've read documentation for scipy and numpy. I think I'm looking for a continuous version of random.choice.

Company Weight
a 30
a 45
a 27
a na
a 57
a 57
a na

I'm trying to fill the NA columns by creating a continuous distribution using the data I already have.

I've tried using np.random.choice so far, ie: random.choice(30,45,27,57, [0.2,0.2,0.2,0.4])

However, this only returns back the specific arguements I input, however, I am trying to create a continuous model so that I can return any number between 27 and 57 with probability based on how many times a certain value appears in my previous data.

So in this case, numbers closer to 57 will be more likely to be chosen as it appears more frequently in my previous data.

Peter O.
  • 32,158
  • 14
  • 82
  • 96
  • If possible, can you please provide some sample data and an example to help us solve your issue? What have you tried so far? If you are trying to fill in missing values "NaN" I suggest you look into [.fillna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html). From there you can use an anonymous function and fill the values are you please, "using something like random. choice". – A.J. Nock May 25 '21 at 03:40
  • The first step is to identify the proper statistical distribution for your data which by itself is an incredibly challenging part of your problem. If you randomly choose a number from 0 to 1, then you're assuming your data follows a uniform distribution which may or may not be true. See [this post](https://stackoverflow.com/questions/6620471/fitting-empirical-distribution-to-theoretical-ones-with-scipy-python?lq=1) for an example of fitting data to a theoretical distribution. – Joe Mercurio May 25 '21 at 04:02
  • Hi! Thank you for the comments! I have added an example for clarity! – ChrisHo1341 May 25 '21 at 04:49

1 Answers1

1

Kernel density estimation (KDE) is a common method to generate continuous distributions from sample data, but it generally requires tuning some parameters. Other methods include mean/mode imputation (basic) and model-based prediction (more sophisticated).

We fit a kernel density estimator below and then generate random samples from the density with kde.sample to fill the nan values below:

import pandas as pd
import numpy as np
from numpy import nan
from sklearn.neighbors import KernelDensity

BANDWIDTH = 1
KERNEL = "gaussian"

data = {'company': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A', 5: 'A', 6: 'A'},
'weight': {0: 30.0, 1: 45.0, 2: 27.0, 3: nan, 4: 57.0, 5: 57.0, 6: nan}}
df = pd.DataFrame.from_dict(data)

kde = KernelDensity(kernel=KERNEL, bandwidth=BANDWIDTH).fit(df[["weight"]].dropna().values)

# replace nan with sampled values from kde    
n_missing = df.weight.isna().sum()
df.loc[df.weight.isna(), "weight"] = kde.sample(n_missing)

output:

  company     weight
0       A  30.000000
1       A  45.000000
2       A  27.000000
3       A  56.542771
4       A  57.000000
5       A  57.000000
6       A  38.970918

sample data and density plots:

import plotly.express as px

# histogram
px.histogram(df.weight, nbins=40).show()

# density line plot
x_vals = np.linspace(df.weight.min(), df.weight.max(), 1000)
density = np.exp(kde.score_samples(x_vals.reshape(-1,1)))
px.line(x=x, y=density).show()

enter image description here

enter image description here

anon01
  • 10,618
  • 8
  • 35
  • 58
  • 1
    Thank you very much for the answer! Seems like I will have to read up on scikit learn, but I think this is what im looking for! – ChrisHo1341 May 25 '21 at 06:15