How to implement hierarchical Transformer for document classification in Keras?

Question

Hierarchical attention mechanism for document classification has been presented by Yang et al. https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf

Its implementation is available on https://github.com/ShawnyXiao/TextClassification-Keras

Also, the implementation of the document classification with Transformer is available on https://keras.io/examples/nlp/text_classification_with_transformer

But, it's not hierarchical.

I have googled a lot but didn't find any implementation of a hierarchical Transformer. Does anyone know how to implement a hierarchical transformer for document classification in Keras?

My implementation is as follows. Note that the implementation extended from Nandan implementation for document classification. https://keras.io/examples/nlp/text_classification_with_transformer.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from keras.utils.np_utils import to_categorical


class MultiHeadSelfAttention(layers.Layer):
    def __init__(self, embed_dim, num_heads=8):
        super(MultiHeadSelfAttention, self).__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        if embed_dim % num_heads != 0:
            raise ValueError(
                f"embedding dimension = {embed_dim} should be divisible by number of heads = {num_heads}"
            )
        self.projection_dim = embed_dim // num_heads
        self.query_dense = layers.Dense(embed_dim)
        self.key_dense = layers.Dense(embed_dim)
        self.value_dense = layers.Dense(embed_dim)
        self.combine_heads = layers.Dense(embed_dim)

    def attention(self, query, key, value):
        score = tf.matmul(query, key, transpose_b=True)
        dim_key = tf.cast(tf.shape(key)[-1], tf.float32)
        scaled_score = score / tf.math.sqrt(dim_key)
        weights = tf.nn.softmax(scaled_score, axis=-1)
        output = tf.matmul(weights, value)
        return output, weights

    def separate_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.projection_dim))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, inputs):
        # x.shape = [batch_size, seq_len, embedding_dim]
        batch_size = tf.shape(inputs)[0]
        query = self.query_dense(inputs)  # (batch_size, seq_len, embed_dim)
        key = self.key_dense(inputs)  # (batch_size, seq_len, embed_dim)
        value = self.value_dense(inputs)  # (batch_size, seq_len, embed_dim)
        query = self.separate_heads(
            query, batch_size
        )  # (batch_size, num_heads, seq_len, projection_dim)
        key = self.separate_heads(
            key, batch_size
        )  # (batch_size, num_heads, seq_len, projection_dim)
        value = self.separate_heads(
            value, batch_size
        )  # (batch_size, num_heads, seq_len, projection_dim)
        attention, weights = self.attention(query, key, value)
        attention = tf.transpose(
            attention, perm=[0, 2, 1, 3]
        )  # (batch_size, seq_len, num_heads, projection_dim)
        concat_attention = tf.reshape(
            attention, (batch_size, -1, self.embed_dim)
        )  # (batch_size, seq_len, embed_dim)
        output = self.combine_heads(
            concat_attention
        )  # (batch_size, seq_len, embed_dim)
        return output

    def compute_output_shape(self, input_shape):
        # it does not change the shape of its input
        return input_shape


class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, dropout_rate, name=None):
        super(TransformerBlock, self).__init__(name=name)
        self.att = MultiHeadSelfAttention(embed_dim, num_heads)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim), ]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(dropout_rate)
        self.dropout2 = layers.Dropout(dropout_rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

    def compute_output_shape(self, input_shape):
        # it does not change the shape of its input
        return input_shape


class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim, name=None):
        super(TokenAndPositionEmbedding, self).__init__(name=name)
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

    def compute_output_shape(self, input_shape):
        # it changes the shape from (batch_size, maxlen) to (batch_size, maxlen, embed_dim)
        return input_shape + (self.pos_emb.output_dim,)



# Lower level (produce a representation of each sentence):

embed_dim = 100  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 64  # Hidden layer size in feed forward network inside transformer
L1_dense_units = 100  # Size of the sentence-level representations output by the word-level model
dropout_rate = 0.1
vocab_size = 1000
class_number = 5
max_docs = 10000
max_sentences = 15
max_words = 60

word_input = layers.Input(shape=(max_words,), name='word_input')
word_embedding = TokenAndPositionEmbedding(maxlen=max_words, vocab_size=vocab_size,
                                           embed_dim=embed_dim, name='word_embedding')(word_input)
word_transformer = TransformerBlock(embed_dim=embed_dim, num_heads=num_heads, ff_dim=ff_dim,
                                    dropout_rate=dropout_rate, name='word_transformer')(word_embedding)
word_pool = layers.GlobalAveragePooling1D(name='word_pooling')(word_transformer)
word_drop = layers.Dropout(dropout_rate, name='word_drop')(word_pool)
word_dense = layers.Dense(L1_dense_units, activation="relu", name='word_dense')(word_drop)
word_encoder = keras.Model(word_input, word_dense)

word_encoder.summary()

# =========================================================================
# Upper level (produce a representation of each document):

L2_dense_units = 100

sentence_input = layers.Input(shape=(max_sentences, max_words), name='sentence_input')

sentence_encoder = tf.keras.layers.TimeDistributed(word_encoder, name='sentence_encoder')(sentence_input)

sentence_transformer = TransformerBlock(embed_dim=L1_dense_units, num_heads=num_heads, ff_dim=ff_dim,
                               dropout_rate=dropout_rate, name='sentence_transformer')(sentence_encoder)
sentence_pool = layers.GlobalAveragePooling1D(name='sentence_pooling')(sentence_transformer)
sentence_out = layers.Dropout(dropout_rate)(sentence_pool)
preds = layers.Dense(class_number , activation='softmax', name='sentence_output')(sentence_out)

model = keras.Model(sentence_input, preds)
model.summary()

The summary of the model is as follows:

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 word_input (InputLayer)     [(None, 60)]              0         
                                                                 
 word_embedding (TokenAndPos  (None, 60, 100)          106000    
 itionEmbedding)                                                 
                                                                 
 word_transformer (Transform  (None, 60, 100)          53764     
 erBlock)                                                        
                                                                 
 word_pooling (GlobalAverage  (None, 100)              0         
 Pooling1D)                                                      
                                                                 
 word_drop (Dropout)         (None, 100)               0         
                                                                 
 word_dense (Dense)          (None, 100)               10100     
                                                                 
=================================================================
Total params: 169,864
Trainable params: 169,864
Non-trainable params: 0
_________________________________________________________________
Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 sentence_input (InputLayer)  [(None, 15, 60)]         0         
                                                                 
 sentence_encoder (TimeDistr  (None, 15, 100)          169864    
 ibuted)                                                         
                                                                 
 sentence_transformer (Trans  (None, 15, 100)          53764     
 formerBlock)                                                    
                                                                 
 sentence_pooling (GlobalAve  (None, 100)              0         
 ragePooling1D)                                                  
                                                                 
 dropout_9 (Dropout)         (None, 100)               0         
                                                                 
 sentence_output (Dense)     (None, 5)                 505       
                                                                 
=================================================================
Total params: 224,133
Trainable params: 224,133
Non-trainable params: 0

Everything is ok and you can copy and paste these codes in colab to see the summary of the model. But, my problem is for positional encoding at the sentence level. How to apply positional encoding at the sentence level?

rudolfovic · Answer 1 · 2021-12-08T08:30:10.640

2

The implementation is recursive in the sense that you treat the average of your outputs of transformer x as the input to transformer x+1.

So let's say your data is structured as (batch, chapter, paragraph, sentence, token).

After the first transformation you end up with (batch, chapter, paragraph, sentence, embedding) so then you average and get (batch, chapter, paragraph, sentence_embedding_in).

Apply another transformation and get (batch, chapter, paragraph, sentence_embedding_out).

Average again and get (batch, chapter, paragraph_embedding). Rinse & Repeat.

The implementation of the paper is actually in a different repository: https://github.com/ematvey/hierarchical-attention-networks

They actually do something different from what I've described and apply transformers at the bottom and RNN at the top. In theory you could do the opposite or apply RNN at each layer (that would be really slow). As far as the implementation is concerned you can abstract from that - the principle remains the same: you apply a transformation, average the outputs and feed it into the next higher-level "layer" (or "module" using torch lingo).

edited Dec 08 '21 at 08:30

answered Dec 08 '21 at 08:23

rudolfovic

3,163
2
14
38

Thank you a lot for your timely response. I edited the post and added my implementation of this model. Can you please look at these codes and tell me if it has been implemented correctly or not. my problem is at positional encoding at the sentence level. According to the implemented model, can you tell me how to do positional encoding at the sentence level? – Rahman Dec 08 '21 at 09:12
1

It should be done exactly the same way as with words (you just treat each sentence as if it were a word) - that's if sentence order matters at all. In some cases it doesn't and so you just don't add anything at all – rudolfovic Dec 08 '21 at 10:25
As you can see in the code, TokenAndPositionEmbedding gets vocab size as one of the inputes. But at the sentence level, I don't have vocab size. So I don't know how to apply sentence-level positional encoding. Is it possible for you to look at my model and help me to complete it? – Rahman Dec 08 '21 at 11:08
1

Maybe create a dummy token (eg 0) for every sentence so that your TokenAndPositionEmbedding would only include the positional component. Then add the resulting embeddings to your actual sentence embeddings. – rudolfovic Dec 08 '21 at 14:22
Can u please show me in the code? My code is executable in colab without any error. Thank u. – Rahman Dec 08 '21 at 15:50
don't have any suggestions? – Rahman Dec 09 '21 at 06:56
I'm sorry, that requires concentration capacity that I simply don't have right now. If you try out my suggestions and still run into issues I might be able to have a look and maybe point out where I think the problem lies – rudolfovic Dec 09 '21 at 09:25
Thank you. I really tried and searched a lot about this problem but didn't find any solution. I would be grateful if you could look at the code to help me to solve this problem. Thank u, Thank you – Rahman Dec 09 '21 at 12:38
I am waiting for your response – Rahman Dec 16 '21 at 13:28

How to implement hierarchical Transformer for document classification in Keras?

1 Answers1