How to handle missing data in machine learning?

Question

I have a dataframe which always has missed information between 9pm of Fridays and 0am on Mondays. I'm using this data to make prediction trough linear regression algorithm, so this jump gumps up my predictions:

                    date   timestamp   liters  next_liters
...
3442 2017-02-03 19:00:00  1486148400  0.86261      0.86354
3443 2017-02-03 20:00:00  1486152000  0.86354      0.86356
3444 2017-02-03 21:00:00  1486155600  0.86356      1.86330
3445 2017-02-06 00:00:00  1486339200  1.86330      1.86305
3446 2017-02-06 01:00:00  1486342800  1.86305      1.86321
3447 2017-02-06 02:00:00  1486346400  1.86321      1.86352
3448 2017-02-06 03:00:00  1486350000  1.86352      1.86311
3449 2017-02-06 04:00:00  1486353600  1.86311      1.86271
...

I wonder how could I handle this so the Friday to Monday won't be taken into account when processing the data by the algorithm.

I though in converting those values to NaN, but, as you know, sklearn don't allow this kind of information.

This is my current code:

df = df[['date', 'epoch', 'liters']]
df['next_liters'] = df['liters'].shift(-1)

df.dropna(inplace=True)

X = np.array(df.drop(['next_liters'], 1))
X = preprocessing.scale(X)

y = np.array(df['next_liters'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

clf = LinearRegression(fit_intercept=True, n_jobs=-1)
clf.fit(X_train, y_train)

print ("LinearRegression (" + str(clf.score(X_test, y_test)) + ")")
print (clf.predict(X_test))

See what the results look like if you project a line between the last and first values. It's a technique that works for many machine learning systems. If it doesn't come out close (the best you can expect), try adjusting the scale of the line or even try curves. — Mike, Feb 06 '17 at 21:44

score 0 · Answer 1 · edited May 23 '17 at 12:17

We can get the weekday with pandas built in functions, make a new column from it, filter the df to exclude "saturday" and "sunday", and then filter again to throw out any date greater than 20:59:59 on a friday.

This of course has nothing to do with ML, but is just some indexing with pandas.

df['weekday'] = df['date'].dt.dayofweek
df = df[(df['date'] <5)]

now we need to filter anything later than 21:00:00 on a friday (weekday = 4). We can do this by grabbing the hour out of our timestamp (not elegant but i need to make a new column again, I'm sure there is a cleaner way though!)

def hr_func(ts):
    return ts.hour

df['hour'] = df['date'].apply(hr_func)

df = df[(df['weekday'] !=4 ) | (df['hour'] <21)
df.head()

In other words, if its not friday, keep it, and if it is friday but before 9pm, keep it.

I am pretty sure this should work!

http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing

"The day of the week with Monday=0, Sunday=6" from: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DatetimeIndex.weekday.html

and

Get weekday/day-of-week for Datetime column of DataFrame

Thank you very much! Now I know how to select it, but, how could I handle the data so the algorithm don't have into account this jump? — mllamazares, Feb 06 '17 at 22:57
Ahh I see. I think I misunderstood the original question. I had read it as you wanted a way to filter out saturday, sunday, and after 9pm friday, but rather you want to have a way for that missing time to be dealt with in the Linear Regression. I have no answer unfortunately. — Dylan, Feb 07 '17 at 13:35

How to handle missing data in machine learning?

1 Answers1