0

I am working on a linear regression task and I only know the concept of simple linear regression where we give an 'x' value and it predicts the 'y' value.

I have generated semi-random numbers between 100 to 100000 using a specific algorithm and save the result in a CSV column.

Now I want to use this column and train a Linear Regressor that it learns the sequence between these numbers and then to predict a number on the basis of the last number which I will give to it.

Or I can Treat this problem as a sequence generation problem using LSTM. Will LSTM is a good approach for this, in which I will feed this 1-D dataset of numbers and on the basis of this LSTM will generate more numbers?

I have only one column which is x column and doesn't have a y column. I searched "How to use linear regression on 1-D data" but found nothing.

Is there any way to train a Linear Regression on 1-D data to predict a number? I am using Python language for this task.

My CSV file looks like this:

CSV of 1-D numbers

  • You need dependent and independent variables to do regression. In simple terms - you need at least 2 dimension data to do any type of regression. – lostin Apr 21 '20 at 19:55
  • "Regression" implicitly involves 2D data; it is an approximation of Y, if Y = f(X). X is the independent variable and Y is the dependent variable. – Susmit Agrawal Apr 21 '20 at 19:56
  • You can see in my CSV image. My next number is dependent on my previous number. Can't I use this logic and do regression, sir? – Asad Faraz Apr 21 '20 at 19:58
  • Maybe one of [1](https://stackoverflow.com/questions/16830946) [2](https://stackoverflow.com/questions/38674197) [3](https://stackoverflow.com/questions/55957474) [4](https://stackoverflow.com/questions/51524005) [5](https://stackoverflow.com/questions/46001464) [6](https://stackoverflow.com/questions/50794383) – ggorlen Apr 21 '20 at 20:05
  • @AsadFaraz First, this question should have been asked on [CrossValidated](https://stats.stackexchange.com/) StackExchange. A lot more statistics-related stuff in there :) Also, yes, you have a "dependent" variable or something to work with in there, but! you haven't extracted it yet, therefore you can't use it. To predict the numbers you'd need to have something to predict them from.. some factors that will affect the predicted (Y) value. Perhaps some input variables you used for that "specific algorithm". If not, you'll need to invent the X, the variable that can be used to predict your Y. – Peter Badida Apr 21 '20 at 20:06
  • Sir, The prediction of the next number will be on the basis of all the previous numbers and the last number. – Asad Faraz Apr 21 '20 at 20:06

1 Answers1

0

I think you can get some idea from time series analysis processes like moving averages and auto regressive and create the dataset that can fit for regression problem.

You can plot auto-correlation to find how many lags do you need to consider for next prediction. you can use pandas autocorr function to find the auto-correlation up to some lag and plot the correlogram.

lets say your last 5 values are highly correlated with the latest value.

then you can stack these numbers as a one row like this,in your case latest value is t,

           | ----------  X_train --------------------|                   |-- y train|
1st row -> 226,200,1169,134,117 (t-1 ,t-2,t-3,t-4,t-5) predicted value -> 239 (t)
2nd row -> 200,1169,134,117,759 (t-2 ,t-3,t-4,t-5,t-6) predicted value -> 226 (t-1)
3rd row -> 1169,134,117,759,102 (t-3 ,t-4,t-5,t-6,t-7) predicted value -> 200 (t-2)
......................................................  ...................so on..

Pandas shift method is use to shift the dataset by lag by lag easily and create the dataset. Now you have X_train and y_train set.Split the dataset and train a linear model.

Rajith Thennakoon
  • 3,975
  • 2
  • 14
  • 24
  • Sir, my series autocorr() at lag =1 returns -0.020421683003898256 – Asad Faraz Apr 22 '20 at 07:25
  • @AsadFaraz just call me by name. =D. check auto-correlation for different lags,like up to lag 10 or 20..and find the most correlated lag.As you can understand that in correlogram first bar value is always 1, In correlogram find the tallest bar or bars and get the lag for that.(which is highest correlation with the first bar ) – Rajith Thennakoon Apr 22 '20 at 07:36
  • use this to plot acf function. https://www.statsmodels.org/stable/generated/statsmodels.graphics.tsaplots.plot_acf.html – Rajith Thennakoon Apr 22 '20 at 07:37
  • my graph is not a satisfying one. But I can make my data up to t-5. I think I should try that! – Asad Faraz Apr 22 '20 at 08:11
  • you can try LSTM as well.if you have enough data LSTM will identify underline relationship than lienar model – Rajith Thennakoon Apr 22 '20 at 08:33