0

I am fairly new to Machine Learning and Python and I am creating an MLP. I have two CSV files on which I want to train my model. Both the CSV files have the same dimensions i.e. both the files contain 4 features along with four columns (A, B, C, D) and output in the fifth column (E) which is basically a binary classification and has the same number of rows. I was trying to train my model on both the files and then test it on another test file. I have searched through different solutions to get around this issue. One way to solve this would be to concatenate both the files and then train my model in a batch-wise manner. But, I was thinking if there is a way to train my model on individual files and then test it on the test dataset?

import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_auc_score, 
roc_curve, accuracy_score
from keras.callbacks import EarlyStopping, ModelCheckpoint
import seaborn as sns
from keras.models  import Sequential, K
from keras.layers import Input, Dense, Flatten, Dropout, BatchNormalization
from keras.optimizers import Adam, SGD, RMSprop
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

file1 = "/home/Documents/data_in1.csv"
file2 = "/home/Documents/data_in21.csv"
files = [file1,file2]

model = Sequential()
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

def BatchGenerator(files):
    for file in files:
        current_data = pickle.load(open(file, "rb"))
        X_train = current_data[:,:-1]
        y_train = current_data[:,-1]
        yield (X_train, y_train)

n_epochs = 100
# train model on each dataset
for epoch in range(n_epochs):
    for (X_train, y_train) in BatchGenerator(files):
        model.fit(X_train, y_train, batch_size = 32, nb_epoch = 1)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

When I run this code I get an unpickling error

UnpicklingError                           Traceback (most recent call last)
<ipython-input-34-c753f0ea5795> in <module>
 31 # train model on each dataset
 32 for epoch in range(n_epochs):
> 33     for (X_train, y_train) in BatchGenerator(files):
 34         model.fit(X_train, y_train, batch_size = 32, nb_epoch = 1)

<ipython-input-34-c753f0ea5795> in BatchGenerator(files)
 23 def BatchGenerator(files):
 24     for file in files:
> 25         current_data = pickle.load(open(file, "rb"))
 26         X_train = current_data[:,:-1]
 27         y_train = current_data[:,-1]

UnpicklingError: could not find MARK

Will this be the wrong way to train my model? Is there a better way to go around this issue when training my model with more than one training dataset? Any help would be much appreciated.

1 Answers1

1

Possibly duplicated question: _pickle.UnpicklingError: could not find MARK and Training a Neural Network with Multiple Datasets (Keras)

According to the first link, this should fix your error:

for file in files:
    with open(file, "rb") as f:
        f.seek(0)
        current_data = pickle.load(f)
        ...

Edit 1: since the above did not solve your problem, I suggest using another library, like pandas.

First, import the library:

import pandas as pd

And then:

for file in files:
    df = pd.read_csv(file)
    X_train = df.drop(["E"], axis=1)
    y_train = df["E"]
    yield (X_train, y_train)
Tedpac
  • 862
  • 6
  • 14