I am trying to write a custom categorical loss function using single y_true and y_pred of multiple records sharing the same email address based on the following article: custom_loss_in_keras. The input data and a preparation code look like as follows:
Table
feature1 | feature2 | feature3 | y_true | y_pred | |
---|---|---|---|---|---|
a@email_com | 0.23 | 2.44 | 3.1 | onehot | onehot |
a@email_com | 1.21 | -2.1 | 2.1 | onehot | onehot |
a@email_com | 0.5 | -1.1 | 2.5 | onehot | onehot |
... | ... | ... | ... | ... | ... |
zzz@email_com | 2.334 | 2.5 | 4.4 | onehot | onehot |
zzz@email_com | 3.25 | 3.6 | 4.2 | onehot | onehot |
zzz@email_com | 2.85 | 2.97 | 4.3 | onehot | onehot |
Note 1: onehot is either [0 1] or [1 0].
Note 2: the feature is of length 576, but I only show the first element of features for demonstration purpose.
Code
from sklearn.preprocessing import LabelEncoder
x_train, x_train_email, y_train = get_train_data()
x_train_email_le = LabelEncoder()
x_train_email_enc = x_train_email_le.fit_transform(x_train_email).astype(np.float64)
y_train_email_concat = np.column_stack((y_train, x_train_email_enc[:, None]))
# n_samples, feature_length, n_features
print(x_train.shape, x_train_email.shape, y_train.shape, y_train_email_concat.shape)
# (15692, 576, 3) (15692, 2) (15692,) (15692, 3)
A categorical loss I am trying to implement is an ensemble loss of all the records of same email address. For instance, the training dataset contains 86 a@email.com and 28 zzz@email.com records. The most frequent y_pred for a@email.com, and zzz@email.com is [0 1], and [1 0], respectively. The original categorical_crossentropy method in keras takes separate 86, and 28 y_preds and y_trues. However, I want to use single y_pred and y_true for a@email.com and zzz@email.com.
Initially, I have written a custom callback method, which works as expected but it is way too slow. Thus, I could use the code only on every epoch end, but not on every batch.
class EnsembleEvaluator(Callback):
def __init__(self, train_data=(), validation_data=()):
super(Callback, self).__init__()
self.value_dict = {}
self.train_value_dict = {}
self.x_train, self.x_train_email, self.y_train = train_data
self.x_val, self.x_val_email, self.y_val = validation_data
self.CCE = tf.keras.losses.CategoricalCrossentropy()
self.ACC = tf.keras.metrics.Accuracy()
self.AUC = tf.keras.metrics.AUC()
def get_acc_loss_auc(self, model, x, x_email, y_true_dummy):
y_pred_dummy = model.predict(x, verbose=0)
y_true = y_true_dummy[:, 0:2]
y_pred = y_pred_dummy[:, 0:2]
y_pred_inv = np.argmax(y_pred, axis=-1)
y_true_inv = np.argmax(y_true, axis=-1)
ensemble_dict = {}
for sample_id in range(x_email.shape[0]):
y_val_p = y_pred_inv[sample_id]
y_val_t = y_true_inv[sample_id]
email = x_email[sample_id]
if email not in ensemble_dict:
ensemble_dict[email] = {"y_true": -1, "y_preds": []}
ensemble_dict[email]["y_preds"].append(y_val_p)
ensemble_dict[email]["y_true"] = y_val_t
ens_y_true = []
ens_y_pred = []
for email in ensemble_dict.keys():
v_dict = ensemble_dict[email]
tmp_y_true = v_dict["y_true"]
tmp_y_pred, tmp_y_pred_occ = Counter(v_dict["y_preds"]).most_common(1)[0]
ens_y_true.append(tmp_y_true)
ens_y_pred.append(tmp_y_pred)
self.ACC.update_state(ens_y_true, ens_y_pred)
self.AUC.update_state(ens_y_true, ens_y_pred)
acc = self.ACC.result().numpy()
auc = self.AUC.result().numpy()
ens_y_true = np.array(ens_y_true, dtype=np.float32)
ens_y_pred = np.array(ens_y_pred, dtype=np.float32)
loss = self.CCE(ens_y_true, ens_y_pred).numpy()
return acc, auc, loss, ens_y_true, ens_y_pred
def on_epoch_end(self, epoch, logs):
acc, auc, loss, ens_y_true, ens_y_pred = self.get_acc_loss_auc(self.model, self.x_val, self.x_val_email, self.y_val)
self.value_dict["val_acc"] = acc
self.value_dict["val_loss"] = loss
self.value_dict["val_auc"] = auc
logs.update(self.value_dict)
I have run the above callback as follows:
ee = EnsembleEvaluator(train_data=(x_train, x_train_email, y_train_email_concat), validation_data=(x_val, x_val_email, y_val_email_concat))
#...
history_obj = model.fit(x_train, y_train, epochs=n_epochs, validation_data=(x_val, y_val), class_weight=class_weights, batch_size=batch_size, callbacks=[ee], verbose=1)
I found that I should write a custom loss function instead of using a callback for faster training. I think that the reason why it is very slow is due to the host-and-gpu interaction. The original keras code may run in GPU while the above code runs in host CPU. Thus, the code above should be written in tensorflow context, but I do not know how to write the same logic in tensorflow context.
I would really appreciate it if you could help me to write a custom category_cross_entropy class or function, which has the identical logic with EnsembleEvaluator.get_acc_loss_auc(), and hopefully it may allow to use the class_weight parameter as well.
Thanks in advance.