4

I am building an application in Python which can predict the values for Pm2.5 pollution from a dataframe. I am using the values for November and I am trying to first build the linear regression model. How can I make the linear regression without using the dates? I only need predictions for the Pm2.5, the dates are known. Here is what I tried so far:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

data = pd.read_csv("https://raw.githubusercontent.com/iulianastroia/csv_data/master/final_dataframe.csv")
data['day'] = pd.to_datetime(data['day'], dayfirst=True)

#Splitting the dataset into training(70%) and test(30%)
X_train, X_test, y_train, y_test = train_test_split(data['day'], data['pm25'], test_size=0.3,
                                                    random_state=0
                                                    )

#Fitting Linear Regression to the dataset
lin_reg = LinearRegression()
lin_reg.fit(data['day'], data['pm25'])

This code throws the following error:

ValueError: Expected 2D array, got 1D array instead:
array=['2019-11-01T00:00:00.000000000' '2019-11-01T00:00:00.000000000'
 '2019-11-01T00:00:00.000000000' ... '2019-11-30T00:00:00.000000000'
 '2019-11-30T00:00:00.000000000' '2019-11-30T00:00:00.000000000'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
petezurich
  • 9,280
  • 9
  • 43
  • 57
moro_92
  • 171
  • 1
  • 2
  • 10
  • 1
    Try reshaping your data just as the error mentioned. Does this answer your question? [Error in Python script "Expected 2D array, got 1D array instead:"?](https://stackoverflow.com/questions/45554008/error-in-python-script-expected-2d-array-got-1d-array-instead) – MattR Feb 27 '20 at 13:07
  • 1
    `lin_reg.fit(data[['day']], data['pm25'])`, notice the double brackets. – Quang Hoang Feb 27 '20 at 13:08
  • 1
    And why don't you use `X_train`and `y_train` for fitting your model? – petezurich Feb 27 '20 at 13:08

2 Answers2

5

You need to pass pandas dataframe instead of pandas series for X values, so you might want to do something like this,

UPDATE:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import datetime

data = pd.read_csv("https://raw.githubusercontent.com/iulianastroia/csv_data/master/final_dataframe.csv")
data['day'] = pd.to_datetime(data['day'], dayfirst=True)

print(data.head())

x_data = data[['day']]
y_data = data['pm25']

#Splitting the dataset into training(70%) and test(30%)
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.3,
                                                    random_state=0
                                                    )
# linear regression does not work on date type of data, convert it into numerical type
X_train['day'] = X_train['day'].map(datetime.datetime.toordinal)
X_test['day'] = X_test['day'].map(datetime.datetime.toordinal)

#Fitting Linear Regression to the dataset
lin_reg = LinearRegression()
lin_reg.fit(X_train[["day"]], y_train)

Now you can predict the data using,

print(lin_reg.predict(X_test[["day"]])) #-->predict the data
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • If I use that, it throws this error: KeyError: "None of [Index(['day'], dtype='object')] are in the [index]" – moro_92 Feb 27 '20 at 13:35
  • I think you want to do the same in the train/test split: replace `train_test_split(data['day'], data['pm25'],...)` with `train_test_split(data[['day']], data['pm25'],...)` – Itamar Mushkin Feb 27 '20 at 13:36
  • Using lin_reg.fit(X_train[['day']], y_train['pm25']) throws the error KeyError: 'pm25', so it is not working – moro_92 Feb 27 '20 at 13:41
  • @iulianaiuliana You have to pass a dataframe for your `x_data` in `train_test_split`. As `y_train` is already a pandas series representing column `pm25` so you don't have to call y_train["pm25"]. – Shubham Sharma Feb 27 '20 at 13:48
1

This is just something else to add to why you need the "[[", and how to avoid the frustration.

The reason the data[['day']] works and data['day'] doesn't is that the fit method expects for X an tuple of 2 with shape, but not for Y, see the vignette:

fit(self, X, y, sample_weight=None)[source]¶ Fit linear model.

Parameters X{array-like, sparse matrix} of shape (n_samples, n_features) Training data

yarray-like of shape (n_samples,) or (n_samples, n_targets) Target values. Will be cast to X’s dtype if necessary

So for example:

data[['day']].shape
(43040, 1)
data['day'].shape
(43040,)
np.resize(data['day'],(len(data['day']),1)).shape
(43040, 1)

These work because they have the structure required:

lin_reg.fit(data[['day']], data['pm25'])
lin_reg.fit(np.resize(data['day'],(len(data['day']),1)), data['pm25'])

While this doesn't:

lin_reg.fit(data['day'], data['pm25'])

Hence before running the function, check that you are providing input in the required format :)

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Thank you so much for the explanations! Can you please check if the polynomial regression is properly applied? I want to use train data 70% and test data 30%, but i am not sure if i managed this correctly: https://pastebin.com/QzUUkVxh – moro_92 Feb 27 '20 at 14:44