0

I have created a simple machine learning model to predict the multiplication of two given numbers. I followed a youtube tutorial to learn the basic and try to work on this simple idea.

My model has three dense layers - input, hidden, output. Input and hidden were using same activation function 'relu' which were giving me loss as NaN on model fit so I changed one of them to sigmoid which started giving me 0.00000+e... something as loss.

I don't know what is wrong. Anyone can please direct me what I am doing wrong or assuming wrong?

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 

df = pd.read_csv('data.csv')
print(df)

x = np.array(df['X'])
y = np.array(df['Y'])
s = np.array(df['S'])

def build_model():
    model = keras.Sequential()
    inputLayer = layers.Dense(64, activation='sigmoid', input_shape=[2])
    hiddenLayer = layers.Dense(64, activation='relu')
    outputLayer = layers.Dense(1)
    model.add(inputLayer)
    model.add(hiddenLayer)
    model.add(outputLayer)
    model.compile(optimizer='sgd', loss='mean_squared_error',metrics=['accuracy'])
    return model

model = build_model() 
print(model.summary())

EPOCHS = 1000

# I didn't know how to provide mulitple input to my model for 
# training so I checked stackoverflow here
# https://stackoverflow.com/questions/55233377/keras-sequential-model-with-multiple-inputs?noredirect=1&lq=1 
 
merged_array = np.stack([x, y], axis=1)


history = model.fit(merged_array, s, epochs=EPOCHS, validation_split = 0.2, verbose=2)
print(history)

print(model.predict([[2,3],]))

Disclaimer: I am a beginner and I have just started using keras and python for the first time in my life.

Nicolas Gervais
  • 33,817
  • 13
  • 115
  • 143
B L Λ C K
  • 600
  • 1
  • 8
  • 24

2 Answers2

2

It does work for smaller numbers with ReLU activation.

from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

x = np.random.randint(0, 10, 1000)
y = np.random.randint(0, 10, 1000)
s = x*y


def build_model():
    model = keras.Sequential()
    model.add(layers.Dense(64, activation='relu', input_shape=[2]))
    model.add(layers.Dense(64, activation='relu'))
    model.add(layers.Dense(1))
    model.compile(optimizer=keras.optimizers.Adam(lr=0.01),
                  loss='mean_squared_error')
    return model


model = build_model()

merged_array = np.stack([x, y], axis=1)

history = model.fit(merged_array, s, epochs=250,
                    validation_split=0.2)

test_input = [2, 3]

print('\n{} x {} ='.format(*test_input),
      np.round(model.predict([test_input])[0][0]).astype(int))
2 x 3 = 6

SGD also works, but it requires standardization/normalization, which kind of defeats the purpose of your task, so I changed it. But it also works.

from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

x = np.random.randint(0, 10, 1000)
y = np.random.randint(0, 10, 1000)
s = x*y
x = x/10
y = y/10


def build_model():
    model = keras.Sequential()
    model.add(layers.Dense(64, activation='relu', input_shape=[2]))
    model.add(layers.Dense(64, activation='relu'))
    model.add(layers.Dense(1))
    model.compile(optimizer=keras.optimizers.SGD(0.001), loss='mean_squared_error')
    return model


model = build_model()

merged_array = np.stack([x, y], axis=1)

history = model.fit(merged_array, s, epochs=250,
                    validation_split=0.2, batch_size=16)

test_input = [2/10, 3/10]

print('\n{} x {} ='.format(*map(lambda l: int(l*10), test_input)),
      np.round(model.predict([test_input])[0][0]).astype(int))
Nicolas Gervais
  • 33,817
  • 13
  • 115
  • 143
  • Gervals thank you for the code. Sorry I didn't check before and after modifying my code, i found that we have written similar code. Anyway, adding some explanation would be really helpful such as why did you choose Adam instead of sgd. Why sgd gives NaN but Adams works fine. Things like that. I found Adam with hit and trail from Keras list of optimzer and checking the results. – B L Λ C K Sep 23 '20 at 02:37
  • 2
    SGD also converges but slower, namely because your features need to be standardized/normalized. See edit. – Nicolas Gervais Sep 23 '20 at 12:34
2

i noticed a couple of issues with your model:

  1. Your input layer is not an input. You do not need to have a designated input layer in this case. The arguement input_shape=[2] is sufficient to add a proper input layer before this layer.

  2. You do not determine any batchsize in the fit function: batches are usually a small subset of your training and validation set (commonly some base-2 numbers like 4, 8, 16, 32, ...). During training not only one sample of your set is used for backpropagating and adjusting your weights (aka "learning") but in batches, which makes it faster. Since your input data are two single floating numbers (I assume) you can choose a really high batchsize like 1024 or higher. The batch size belongs to the so called hyperparameter, which affect your overall training success.

history = model.fit(merged_array, s, batch_size=1024, epochs=EPOCHS, validation_split=0.2, verbose=2)
  1. During training you track the "accuracy" metric. As you are working on a regression problem, this is not helping you in estimating your model's performance. (Accuracy is used for classification problems) You can leave it out

I cannnot give you more specific advice with knowledge about the data you are using, how many, datapoints you have and what kind of numbers you want to multiply (bounded to numbers between 0 and 10, float or integeres,...)

Hope this helps sofar (;

tobi.tobt
  • 91
  • 4
  • awesome reply. let me take in all of your suggestions and get back to you. About the size of data, I am training it with 1000 data points. – B L Λ C K Sep 22 '20 at 14:52
  • I have written similar code as Nicolas Gervais has provided which kind of works with 5000 epochs and batchsize of 1024, given Adam optimizer. Adam was my hit and trial thing. I didn't get it why it worked with that but not with sgd. – B L Λ C K Sep 23 '20 at 02:40
  • I am not very sure, but I heard that the higher the batchsize is, the better the gradient towards your optimial parameter point is estimated. This might be the case here as well. Usually (especially when working with images) your GPU memory limits the batch size. – tobi.tobt Sep 23 '20 at 06:28