I have a dataframe which always has missed information between 9pm of Fridays and 0am on Mondays. I'm using this data to make prediction trough linear regression algorithm, so this jump gumps up my predictions:
date timestamp liters next_liters
...
3442 2017-02-03 19:00:00 1486148400 0.86261 0.86354
3443 2017-02-03 20:00:00 1486152000 0.86354 0.86356
3444 2017-02-03 21:00:00 1486155600 0.86356 1.86330
3445 2017-02-06 00:00:00 1486339200 1.86330 1.86305
3446 2017-02-06 01:00:00 1486342800 1.86305 1.86321
3447 2017-02-06 02:00:00 1486346400 1.86321 1.86352
3448 2017-02-06 03:00:00 1486350000 1.86352 1.86311
3449 2017-02-06 04:00:00 1486353600 1.86311 1.86271
...
I wonder how could I handle this so the Friday to Monday won't be taken into account when processing the data by the algorithm.
I though in converting those values to NaN, but, as you know, sklearn don't allow this kind of information.
This is my current code:
df = df[['date', 'epoch', 'liters']]
df['next_liters'] = df['liters'].shift(-1)
df.dropna(inplace=True)
X = np.array(df.drop(['next_liters'], 1))
X = preprocessing.scale(X)
y = np.array(df['next_liters'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
clf = LinearRegression(fit_intercept=True, n_jobs=-1)
clf.fit(X_train, y_train)
print ("LinearRegression (" + str(clf.score(X_test, y_test)) + ")")
print (clf.predict(X_test))