loss nan when trying to work with tensorflow feature columns

Question

I am trying to follow this example.

The target value I want to predict on is the zg500. The other feature I want to use is tas.

I want to create the feature columns, in order to combine the latitudes and longitudes:

import numpy as np
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import feature_column

df = pd.read_csv('./df.csv')
# if unamed column exists
#df.drop(['Unnamed: 0'],
#          axis=1,
#          inplace=True)

df.dropna(inplace=True)

# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('zg500')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

batch_size = 16 
train_ds = df_to_dataset(df, batch_size=batch_size)

feature_columns = []
tas = feature_column.numeric_column("tas")
latitude = feature_column.numeric_column("lats")
longitude = feature_column.numeric_column("lons")
bucketized_lat = feature_column.bucketized_column(latitude, boundaries=[0, 20, 40, 70])
bucketized_lon = feature_column.bucketized_column(longitude, boundaries=[-45, -20, 0, 20, 60])

feature_columns.append(tas)
feature_columns.append(bucketized_lat)
feature_columns.append(bucketized_lon)
lat_lon = feature_column.crossed_column([bucketized_lat, bucketized_lon], 1000)
lat_lon = feature_column.indicator_column(lat_lon)
feature_columns.append(lat_lon)

feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

Create the model:

model = tf.keras.Sequential([
  feature_layer,
  tf.keras.layers.Dense(10, activation='relu'),
  tf.keras.layers.Dense(1)
])


model.compile(optimizer='adam',
              loss='mse')

 
history = model.fit(train_ds, epochs=2)

Right now, I am receiving nan loss:

10918/10918 [==============================] - 10s 861us/step - loss: nan
Epoch 2/2
10918/10918 [==============================] - 10s 857us/step - loss: nan

Also, I was wondering why using the df dataframe instead of train_ds:

history = model.fit(df.iloc[:, [0, 2, 3]].values,
                    df.iloc[:, 1].values,
                    epochs=2)

produces:

  ValueError: ('We expected a dictionary here. Instead we got: ', <tf.Tensor 'IteratorGetNext:0' shape=(32, 3) dtype=float32>)

You can trace the origin of the first NaN value that appears. It is computed from other values which you can inspect. For example, if the loss is not NaN before the first training step, and then becomes NaN, this is often due to a too high learning rate. See also https://stackoverflow.com/questions/40050397/deep-learning-nan-loss-reasons and similar posts. — root, Jul 31 '21 at 13:55

score 2 · Accepted Answer · answered Aug 19 '21 at 18:09

The reason for getting nan in the loss is that your target values are in the extremes. They are anywhere from e^-32 to e^31. This you can see easily.

df['zg500']
'''
0      -3.996248e-29
1       2.476790e+11
2      -1.010202e+08
3      -1.407987e-02
4       2.240596e-32
            ...     
1742   -1.682389e+11
1743   -4.802401e+00
1744   -3.480795e+31
1745    1.026754e+21
1746    1.790822e+23
Name: zg500, Length: 1739, dtype: float64
'''

The workaround against this that we scale the target. Although this is not recommended but we have no choice. Below is the slight modification using Standard Scaler to scale the targets.

ss = StandardScaler()
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
    dataframe = dataframe.copy()
    labels = ss.fit_transform(dataframe['zg500'].values.reshape(-1,1))
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    return ds

After doing this the below are the results of training the model.

history = model.fit(train_ds, epochs=2)
'''
Consider rewriting this model with the Functional API.
109/109 [==============================] - 1s 804us/step - loss: 27.0520
Epoch 2/10
109/109 [==============================] - 0s 769us/step - loss: 1.0166
Epoch 3/10
109/109 [==============================] - 0s 753us/step - loss: 1.0148
Epoch 4/10
109/109 [==============================] - 0s 779us/step - loss: 1.0115
Epoch 5/10
109/109 [==============================] - 0s 775us/step - loss: 1.0107
Epoch 6/10
109/109 [==============================] - 0s 915us/step - loss: 1.0107
Epoch 7/10
109/109 [==============================] - 0s 1ms/step - loss: 1.0034
Epoch 8/10
109/109 [==============================] - 0s 784us/step - loss: 1.0092
Epoch 9/10
109/109 [==============================] - 0s 735us/step - loss: 1.0151
Epoch 10/10
109/109 [==============================] - 0s 803us/step - loss: 1.0105
'''

Any ideas about my last problem ? `using the df dataframe instead of train_ds`? Thanks! (upv) — George, Aug 20 '21 at 08:44
@George I am not too sure about that. Anyway, I will try to make it work. If it does than I will update my answer. — Abhishek Prajapat, Aug 20 '21 at 13:16

loss nan when trying to work with tensorflow feature columns

1 Answers1