I have an Excel file that stores a sequence in each column (reading from top cell to bottom cell), and the trend of the sequence is similar to the previous column. So I'd like to predict the sequence for the nth column in this dataset.
A sample of my data set:
See that each column has a set of values / sequence, and they sort of progress as we move to the right, so I want to predict e.g. the values in the Z column.
Here's my code so far:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Read the Excel file in rows
df = pd.read_excel(open('vec_sol2.xlsx', 'rb'),
header=None, sheet_name='Sheet1')
print(type(df))
length = len(df.columns)
# Get the sequence for each row
x_train, x_test, y_train, y_test = train_test_split(
np.reshape(range(0, length - 1), (-1, 1)), df, test_size=0.25, random_state=0)
print("y_train shape: ", y_train.shape)
pred_model = LogisticRegression()
pred_model.fit(x_train, y_train)
print(pred_model)
I'll explain the logic as much as possible:
x_train
andx_test
will just be the index / column number that is associated with a sequence.y_train
is an array of sequences.- There is a total of 51 columns, so splitting it with 25% being test data results in 37 train sequences and 13 test sequences.
I've managed to get the shapes of each var when debugging, they are:
x_train
: (37, 1)x_test
: (13, 1)y_train
: (37, 51)y_test
: (13, 51)
But right now, running the program gives me this error:
ValueError: bad input shape (37, 51)
What is my mistake here?