0

I'm currently trying to do some text classification using a neural network with some own inputdata.

Due to my very limited dataset of around 85 positive and 85 negative classified text of ~1500 words per file, I was told to use cross-validation testing for my neural network to bypass overfitting.

I started to build a neural network with some help of YT-Videos and guides and my problem now is how to do the cross-validation testing.

My current code looks like this:

data = pd.read_excel('dataset3.xlsx' )
max_words = 1000
tokenize = text.Tokenizer(num_words=max_words, lower=True, char_level=False)

train_size = int(len(data) * .8)
train_posts = data['Content'][:train_size]
train_tags = data['Value'][:train_size]
test_posts = data['Content'][train_size:]
test_tags = data['Value'][train_size:]

tokenize.fit_on_texts(train_posts)
x_train = tokenize.texts_to_matrix(train_posts)
x_test = tokenize.texts_to_matrix(test_posts)

encoder = LabelEncoder()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)
y_test = encoder.transform(test_tags)
num_classes = np.max(y_train) + 1
y_train = utils.to_categorical(y_train, num_classes)
y_test = utils.to_categorical(y_test, num_classes)

batch_size = 1
epochs = 20
model = Sequential()
model.add(Dense(750, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(num_classes))
model.add(Activation('sigmoid'))

model.summary()

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs= epochs,
                    verbose=1,
                    validation_split=0.2)

I played around with

KFold(n_splits=k, shuffle=True, random_state=1).split(x_train, y_train))

but I have no idea how to use it on the neural network itself, hope you can help me with my problem.

Thanks in regards

Jason

2 Answers2

0

You can use the scikit KFold.split like this:

accuracy = []
kf = KFold(n_splits=k, shuffle=True, random_state=1)
for trainIndices, testIndices in kf.split(x_train, y_train):
    #Start your model
    model = Sequential()
    ...
    history = model.fit(x[trainIndices], y[trainIndices],
                    batch_size=batch_size,
                    epochs= epochs,
                    verbose=1,
                    validation_split=0.2)

    prediction = model.predict(x[testIndices])
    accuracy.append(accuracy_score(y[trainIndices], prediction))

# Print the mean accuracy
print(sum(accuracy)/len(accuracy))

How it works

Look here if you need more information about the technique itself.

About the scikit-learn implementation, the kf.split yields k pairs of indices for train and indices for tests. Yielding values means that this function can be iterated over like a list. Also, remember this function gives you indices, so in order to train a model, you have to get the values like this: x[trainIndices]

For each one of those k models, you'll calculate the accuracy[*] of this trained model over the test partition. After that, you can calculate the mean accuracy to determine the overall accuracy of the model.


[*] For this I used accuracy score from scikit learn, but this could be done manually instead.

Daniel
  • 150
  • 10
0

Do you want to do K-Fold on validation or test ?

K-Fold is pretty simple (you can do it yourself with Random). It splits your input list in to k subsets, return to you 2 array, first (larger) is index of item in (k-1) subsets and second (smaller) is index of k-th subset. Then you decide how to use it. So what K-Flod do is, helping you to choose the best combination between train and test ( or validation ).

Code will be something like this.

kfold = KFold(n_splits=k, shuffle=True, random_state=n) # Choose yours k and n
for arr1, arr2 in kfold.split(X, y):
  x_train, y_train = X[arr1], y[arr1] #
  x_k, y_k = X[arr2], y[arr2]  # 'x_k' should be 'x_test' or 'x_valid' depending on your purpose
  # train your model

So if you want to do K-Fold in validation, next code will be

  model = YourModel()
  model.fit(X_train, y_train, validation_data=(X_k, y_k), epochs=...)

or do K-Fold in test

  model = YourModel()
  model.fit(X_train, y_train, epochs=...) # train without validation
  # model.fit(X_train, y_train, validation_split= ... ) # train with validation
  model.evaluate(X_k)

You will notice that when using K-Fold, you train model k times.

nqt
  • 31
  • 4