1

there!

I'm studying the IBM Data Science course by Coursera and I'm trying to create some snippets to practice. I've created the following code:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

# Import and format the dataframes
ibov = pd.read_csv('https://raw.githubusercontent.com/thiagobodruk/datasets/master/ibov.csv')
ifix = pd.read_csv('https://raw.githubusercontent.com/thiagobodruk/datasets/master/ifix.csv')
ibov['DATA'] = pd.to_datetime(ibov['DATA'], format='%d/%m/%Y')
ifix['DATA'] = pd.to_datetime(ifix['DATA'], format='%d/%m/%Y')
ifix = ifix.sort_values(by='DATA', ascending=False)
ibov = ibov.sort_values(by='DATA', ascending=False)
ibov = ibov[['DATA','FECHAMENTO']]
ibov.rename(columns={'FECHAMENTO':'IBOV'}, inplace=True)
ifix = ifix[['DATA','FECHAMENTO']]
ifix.rename(columns={'FECHAMENTO':'IFIX'}, inplace=True)

# Merge datasets 
df_idx = ibov.merge( ifix, how='left', on='DATA')
df_idx.set_index('DATA', inplace=True)
df_idx.head()

# Split training and testing samples
x_train, x_test, y_train, y_test = train_test_split(df_idx['IBOV'], df_idx['IFIX'], test_size=0.2)

# Convert the samples to Numpy arrays
regr = linear_model.LinearRegression()
x_train = np.array([x_train])
y_train = np.array([y_train])
x_test = np.array([x_test])
y_test = np.array([y_test])

# Plot the result
regr.fit(x_train, y_train)
y_pred = regr.predict(y_train)
plt.scatter(x_train, y_train)
plt.plot(x_test, y_pred, color='blue', linewidth=3) # This line produces no result

I experienced some issues with the output values returned by the train_test_split() method. So I converted them to Numpy arrays, then my code worked. I can plot my scatter plot normally, but I can't plot my prediction line.

Running this code on my IBM Data Cloud Notebook produces the following warning:

/opt/conda/envs/Python36/lib/python3.6/site-packages/matplotlib/axes/_base.py:380: MatplotlibDeprecationWarning: cycling among columns of inputs with non-matching shapes is deprecated. cbook.warn_deprecated("2.2", "cycling among columns of inputs "

I searched on Google and here on StackOverflow, but I can't figure what is wrong.

I'll appreciate some assistance. Thanks in advance!

bodruk
  • 3,242
  • 8
  • 34
  • 52

1 Answers1

1

There are several issues in your code, like y_pred = regr.predict(y_train) and the way you draw a line.

The following code snippet should set you in the right direction:

# Split training and testing samples
x_train, x_test, y_train, y_test = train_test_split(df_idx['IBOV'], df_idx['IFIX'], test_size=0.2)

# Convert the samples to Numpy arrays
regr = linear_model.LinearRegression()
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values

# Plot the result
plt.scatter(x_train, y_train)

regr.fit(x_train.reshape(-1,1), y_train)
idx = np.argsort(x_train)
y_pred = regr.predict(x_train[idx].reshape(-1,1))
plt.plot(x_train[idx], y_pred, color='blue', linewidth=3);

enter image description here

To do the same for the test subset with already fitted model:

# Plot the result
plt.scatter(x_test, y_test)
idx = np.argsort(x_test)
y_pred = regr.predict(x_test[idx].reshape(-1,1))
plt.plot(x_test[idx], y_pred, color='blue', linewidth=3);

enter image description here

Feel free to ask questions if you have any.

Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72
  • Thanks a lot for the assistance! Why do you use `x_train.reshape(-1,1)`? Do I need to convert the array size? – bodruk Mar 03 '20 at 21:34
  • Because your regressor expects a 2d array of input features. I suspect in your version you do not need to reshape as your x_train already is 2d, – Sergey Bushmanov Mar 03 '20 at 21:35
  • Just found this explanation https://stackoverflow.com/a/42510505/2684718. Thanks for your answer! – bodruk Mar 03 '20 at 21:41