I am fairly new to Machine Learning and Python and I am creating an MLP. I have two CSV files on which I want to train my model. Both the CSV files have the same dimensions i.e. both the files contain 4 features along with four columns (A, B, C, D) and output in the fifth column (E) which is basically a binary classification and has the same number of rows. I was trying to train my model on both the files and then test it on another test file. I have searched through different solutions to get around this issue. One way to solve this would be to concatenate both the files and then train my model in a batch-wise manner. But, I was thinking if there is a way to train my model on individual files and then test it on the test dataset?
import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_auc_score,
roc_curve, accuracy_score
from keras.callbacks import EarlyStopping, ModelCheckpoint
import seaborn as sns
from keras.models import Sequential, K
from keras.layers import Input, Dense, Flatten, Dropout, BatchNormalization
from keras.optimizers import Adam, SGD, RMSprop
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
file1 = "/home/Documents/data_in1.csv"
file2 = "/home/Documents/data_in21.csv"
files = [file1,file2]
model = Sequential()
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
def BatchGenerator(files):
for file in files:
current_data = pickle.load(open(file, "rb"))
X_train = current_data[:,:-1]
y_train = current_data[:,-1]
yield (X_train, y_train)
n_epochs = 100
# train model on each dataset
for epoch in range(n_epochs):
for (X_train, y_train) in BatchGenerator(files):
model.fit(X_train, y_train, batch_size = 32, nb_epoch = 1)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
When I run this code I get an unpickling error
UnpicklingError Traceback (most recent call last)
<ipython-input-34-c753f0ea5795> in <module>
31 # train model on each dataset
32 for epoch in range(n_epochs):
> 33 for (X_train, y_train) in BatchGenerator(files):
34 model.fit(X_train, y_train, batch_size = 32, nb_epoch = 1)
<ipython-input-34-c753f0ea5795> in BatchGenerator(files)
23 def BatchGenerator(files):
24 for file in files:
> 25 current_data = pickle.load(open(file, "rb"))
26 X_train = current_data[:,:-1]
27 y_train = current_data[:,-1]
UnpicklingError: could not find MARK
Will this be the wrong way to train my model? Is there a better way to go around this issue when training my model with more than one training dataset? Any help would be much appreciated.