1

I am trying to write a custom categorical loss function using single y_true and y_pred of multiple records sharing the same email address based on the following article: custom_loss_in_keras. The input data and a preparation code look like as follows:

Table

email feature1 feature2 feature3 y_true y_pred
a@email_com 0.23 2.44 3.1 onehot onehot
a@email_com 1.21 -2.1 2.1 onehot onehot
a@email_com 0.5 -1.1 2.5 onehot onehot
... ... ... ... ... ...
zzz@email_com 2.334 2.5 4.4 onehot onehot
zzz@email_com 3.25 3.6 4.2 onehot onehot
zzz@email_com 2.85 2.97 4.3 onehot onehot

Note 1: onehot is either [0 1] or [1 0].

Note 2: the feature is of length 576, but I only show the first element of features for demonstration purpose.

Code

from sklearn.preprocessing import LabelEncoder
x_train, x_train_email, y_train = get_train_data()
x_train_email_le = LabelEncoder()
x_train_email_enc = x_train_email_le.fit_transform(x_train_email).astype(np.float64)
y_train_email_concat = np.column_stack((y_train, x_train_email_enc[:, None]))
# n_samples, feature_length, n_features
print(x_train.shape, x_train_email.shape, y_train.shape, y_train_email_concat.shape)
# (15692, 576, 3) (15692, 2) (15692,) (15692, 3)

A categorical loss I am trying to implement is an ensemble loss of all the records of same email address. For instance, the training dataset contains 86 a@email.com and 28 zzz@email.com records. The most frequent y_pred for a@email.com, and zzz@email.com is [0 1], and [1 0], respectively. The original categorical_crossentropy method in keras takes separate 86, and 28 y_preds and y_trues. However, I want to use single y_pred and y_true for a@email.com and zzz@email.com.

Initially, I have written a custom callback method, which works as expected but it is way too slow. Thus, I could use the code only on every epoch end, but not on every batch.

class EnsembleEvaluator(Callback):
    def __init__(self, train_data=(), validation_data=()):
        super(Callback, self).__init__()
        self.value_dict = {}
        self.train_value_dict = {}
        self.x_train, self.x_train_email, self.y_train = train_data
        self.x_val, self.x_val_email, self.y_val = validation_data
        self.CCE = tf.keras.losses.CategoricalCrossentropy()
        self.ACC = tf.keras.metrics.Accuracy()
        self.AUC = tf.keras.metrics.AUC()
    
    def get_acc_loss_auc(self, model, x, x_email, y_true_dummy):
        y_pred_dummy = model.predict(x, verbose=0)
        y_true = y_true_dummy[:, 0:2]
        y_pred = y_pred_dummy[:, 0:2]
        
        y_pred_inv = np.argmax(y_pred, axis=-1)
        y_true_inv = np.argmax(y_true, axis=-1)
        ensemble_dict = {}
        for sample_id in range(x_email.shape[0]):
            y_val_p = y_pred_inv[sample_id]
            y_val_t = y_true_inv[sample_id]
            email = x_email[sample_id]
            if email not in ensemble_dict:
                ensemble_dict[email] = {"y_true": -1, "y_preds": []}
            ensemble_dict[email]["y_preds"].append(y_val_p)
            ensemble_dict[email]["y_true"] = y_val_t
         
        ens_y_true = []    
        ens_y_pred = []
         
        for email in ensemble_dict.keys():
            v_dict = ensemble_dict[email]
            tmp_y_true = v_dict["y_true"]
            tmp_y_pred, tmp_y_pred_occ = Counter(v_dict["y_preds"]).most_common(1)[0]
            ens_y_true.append(tmp_y_true)
            ens_y_pred.append(tmp_y_pred)

        self.ACC.update_state(ens_y_true, ens_y_pred)
        self.AUC.update_state(ens_y_true, ens_y_pred)
        acc = self.ACC.result().numpy()
        auc = self.AUC.result().numpy()
        ens_y_true = np.array(ens_y_true, dtype=np.float32)
        ens_y_pred = np.array(ens_y_pred, dtype=np.float32)
        loss = self.CCE(ens_y_true, ens_y_pred).numpy()
        return acc, auc, loss, ens_y_true, ens_y_pred
    
    def on_epoch_end(self, epoch, logs):
        acc, auc, loss, ens_y_true, ens_y_pred = self.get_acc_loss_auc(self.model, self.x_val, self.x_val_email, self.y_val)
        self.value_dict["val_acc"] = acc
        self.value_dict["val_loss"] = loss
        self.value_dict["val_auc"] = auc
        logs.update(self.value_dict)

I have run the above callback as follows:

ee = EnsembleEvaluator(train_data=(x_train, x_train_email, y_train_email_concat), validation_data=(x_val, x_val_email, y_val_email_concat))
#...
history_obj = model.fit(x_train, y_train, epochs=n_epochs, validation_data=(x_val, y_val), class_weight=class_weights, batch_size=batch_size, callbacks=[ee], verbose=1)

I found that I should write a custom loss function instead of using a callback for faster training. I think that the reason why it is very slow is due to the host-and-gpu interaction. The original keras code may run in GPU while the above code runs in host CPU. Thus, the code above should be written in tensorflow context, but I do not know how to write the same logic in tensorflow context.

I would really appreciate it if you could help me to write a custom category_cross_entropy class or function, which has the identical logic with EnsembleEvaluator.get_acc_loss_auc(), and hopefully it may allow to use the class_weight parameter as well.

Thanks in advance.

abysslover
  • 683
  • 5
  • 14

0 Answers0