0

I have an Excel file that stores a sequence in each column (reading from top cell to bottom cell), and the trend of the sequence is similar to the previous column. So I'd like to predict the sequence for the nth column in this dataset.

A sample of my data set:

sample data

See that each column has a set of values / sequence, and they sort of progress as we move to the right, so I want to predict e.g. the values in the Z column.

Here's my code so far:

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Read the Excel file in rows
df = pd.read_excel(open('vec_sol2.xlsx', 'rb'),
                header=None, sheet_name='Sheet1')
print(type(df))
length = len(df.columns)
# Get the sequence for each row

x_train, x_test, y_train, y_test = train_test_split(
    np.reshape(range(0, length - 1), (-1, 1)), df, test_size=0.25, random_state=0)

print("y_train shape: ", y_train.shape)

pred_model = LogisticRegression()
pred_model.fit(x_train, y_train)
print(pred_model)

I'll explain the logic as much as possible:

  • x_train and x_test will just be the index / column number that is associated with a sequence.
  • y_train is an array of sequences.
  • There is a total of 51 columns, so splitting it with 25% being test data results in 37 train sequences and 13 test sequences.

I've managed to get the shapes of each var when debugging, they are:

  • x_train : (37, 1)
  • x_test : (13, 1)
  • y_train : (37, 51)
  • y_test : (13, 51)

But right now, running the program gives me this error:

ValueError: bad input shape (37, 51)

What is my mistake here?

Fawwaz Yusran
  • 1,260
  • 2
  • 19
  • 36

1 Answers1

0

I don't understand why are you using this:

x_train, x_test, y_train, y_test = train_test_split(
np.reshape(range(0, length - 1), (-1, 1)), df, test_size=0.25, random_state=0)

You have data here in df. Extract X and y from it and then split it to train and test.

Try this:

X = df.iloc[:,:-1]
y = df.iloc[:, -1:]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

Otherwise, the stats you shared shows you are trying to have 51 columned output from one feature, which is weird if you think about it.

Shweta Chandel
  • 887
  • 7
  • 17
  • Thanks. But what is X referring to now? Also, 2nd question, is it possible for a **set** of values be predicted from the values of previous columns, as I described in the beginning of this thread? – Fawwaz Yusran Nov 05 '18 at 14:43
  • I now get this error when using your solution: `A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().` – Fawwaz Yusran Nov 05 '18 at 14:45
  • X refers to input vectors and y for output vector. And by set if you mean a pure new column having same number of rows as of the input given to the model to predict, then yes. For the error, check this link https://stackoverflow.com/questions/34165731/a-column-vector-y-was-passed-when-a-1d-array-was-expected – Shweta Chandel Nov 06 '18 at 10:51