0

I've seen other similar questions and followed their solutions, to little improvement. I'm making a model to identify the gender of names. As training data I'm using a list of baby names found here: https://www.ssa.gov/oact/babynames/limits.html. I extracted this data to a new data frame, keeping only one instance of those names occurring more than once, and sorted randomly.

Each name string in a column was converted to a numeric array of lenght max_len and normalized by the function:

def text_to_numeric(column, max_len):
    word_characters = []
    for word in column:
        word_characters.append([c for c in word])

    letters_kept = 25
    tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=letters_kept, oov_token='<UNK>')
    tokenizer.fit_on_texts(word_characters)

    word_sequence = tokenizer.texts_to_sequences(word_characters)
    words_pre = tf.keras.preprocessing.sequence.pad_sequences(word_sequence, maxlen=max_len,padding="pre")
    words_pre = tf.keras.utils.normalize(input_data)

    return list(words_pre)

The expected output is an array of 2 element list where [1,0] means “Male” and [0,1] means “Female”. The model, where data_file contains processed names and labels, looks like this:

input_length, input_data, output_data = data_reader(data_file)
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(100, input_dim=input_length, activation='relu'))
model.add(tf.keras.layers.Dense(100, activation='relu'))
model.add(tf.keras.layers.Dense(2, activation='softmax'))

model.compile(loss='binary_crossentropy', optimizer="adam", metrics=['accuracy'])

model.fit(input_data, output_data, epochs=30, verbose=1, validation_split=0.1)

No matter what, I always get an accuracy of around 75%. I don't know how to choose the model parameters, but I’ve tried with many combinations and the accuracy changes little. So far I've tried: normalizing input, balancing the input dataset so there are the same number of men and women, changing the optimizer, defining an optimizer and change the learning rate, changing layer number, nodes per layer and activation function, increasing number of epochs.

All of this with no significant change in the model's accuracy. Am I missing something or doing something completely wrong? Is this accuracy as good as it gets?

Chegon
  • 140
  • 1
  • 8
  • 2
    Technically, when using a `Dense(2, activation='softmax')` as a last layer, you should use `loss='categorical_crossentropy'`. The alternative is to keep `loss='binary_crossentropy'` but changing your last layer to `Dense(1, activation='sigmoid')`. As is, it's not sure that you are getting indeed the correct accuracy - see [this](https://stackoverflow.com/questions/42081257/why-binary-crossentropy-and-categorical-crossentropy-give-different-performances) and [this](https://stackoverflow.com/questions/41327601) threads. – desertnaut Jun 04 '20 at 17:39
  • Thanks for the info, although it does not change the accuracy much. – Chegon Jun 04 '20 at 19:45

0 Answers0