0

I have air pollution time series data that I need to make a forward period estimation. To do so, I used randomforest regressor from scikit-learn to make prediction, and I want to visualize the prediction output but I have trouble visualizing the regression output where x-axis must show the right time index. Can suggest me how should I get better visualization for my below regression approach? Is there any better way to make this happen? Any idea?

my attempt

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor

url = "https://gist.githubusercontent.com/jerry-shad/36912907ba8660e11cd27be0d3e30639/raw/424f0891dc46d96cd5f867f3d2697777ac984f68/pollution.csv"
df = pd.read_csv(url, parse_dates=['date'])
df.date = pd.DatetimeIndex(df.date)
# df.sort_values(by='date').reset_index(drop=True)
df.drop(columns=['Unnamed: 0'],axis=1,inplace=True)
resultsDict={}
predictionsDict={}

split_date ='2017-12-01'
df_training = df.loc[df.date <= split_date]
df_test = df.loc[df.date > split_date]

## exclude pollution_index columns from training and testing data
df_tr = df_training.drop(['pollution_index'],axis=1)
df_te = df_test.drop(['pollution_index'],axis=1)

## scaling features
scaler = StandardScaler() 
scaler.fit(df_tr)
X_train = scaler.transform(df_tr)  
y_train = df_training['pollution_index']
X_test = scaler.transform(df_te)
y_test = df_test['pollution_index']

X_train_df = pd.DataFrame(X_train,columns=df_tr.columns)
X_test_df = pd.DataFrame(X_test,columns=df_te.columns)

reg = RandomForestRegressor(max_depth=2, random_state=0)
reg.fit(X_train, y_train)
yhat = reg.predict(X_test)
resultsDict['Randomforest'] = evaluate(df_test['eyci'], yhat)
predictionsDict['Randomforest'] = yhat

## print out prediction from RandomForest
print(predictionsDict['Randomforest'])
plt.plot(df_test['pollution_index'].values , label='Original')
plt.plot(yhat,color='red',label='predicted')
plt.legend()

output of current attempt

here is the output of the enter image description hereabove attempt:

In this attempt, I tried to make regression using randomforest regressor and intend to make simple plot but plot didn't show time on x-axis? Why? Does anyone know how to make this right? Any thoughts? Thanks

desired plot

Ideally, after trained the model, I want to make a forward period estimation, and this is the possible plot that I want to make from my above attempt:

enter image description here

Can anyone suggest to me the possible way of making the right visualization on regression output? Any thoughts?

kim
  • 556
  • 7
  • 28
  • 2
    Putting dates on axes is a recurring issue in matplotlib and as a result there are a lot of answers to this particular question on SO. – NotAName Dec 08 '20 at 05:38
  • For example, this: https://stackoverflow.com/questions/49418248/plot-x-axis-as-date-in-matplotlib – NotAName Dec 08 '20 at 05:39
  • Please strip your code to the essential in order to provide a [*minimal* reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) – max Dec 08 '20 at 07:08

1 Answers1

0

You will need to provide the dates explicitly to matplotlib.pyplot.plot().

plt.plot(df_test['date'],df_test['pollution_index'].values , label='Original')
plt.plot(df_test['date'],yhat,color='red',label='predicted')

You can also use the matplotlib-based plotting function from pandas:

df_test['yhat'] = yhat
df_test.plot(x='date',y=['pollution_index','yhat'])

It automatically plots title, x/y labels and a legend.

max
  • 3,915
  • 2
  • 9
  • 25