How does GPT-like transformers utilize only the decoder to do sequence generation?

Question

I want to code a GPT-like transformer for a specific text generation task. GPT-like models use only the decoder block (in stacks) [1]. I know how to code all sub-modules of the decoder block shown below (from the embedding to the softmax layer) in Pytorch. However, I don't know what I should give as input. It says (in the figure) "Output shifted right".

For example, this is my data, (where < and > are sos and eos tokens):

< abcdefgh >

What should I give to my GPT-like model to train it properly?

Also, since I am not using a encoder, should I still give input to the multihead attention block?

Sorry if my questions seem a little dumb, I am so new to transformers.

score 4 · Accepted Answer · answered Mar 08 '23 at 18:58

The input for a decoder-only model like GPT is typically a sequence of tokens, just like in an encoder-decoder model. However, the difference lies in how the input is processed.

In an encoder-decoder model, the input sequence is first processed by an encoder component that produces a fixed-size representation of the input, often called the "context vector". The context vector is then used by the decoder component to generate the output sequence.

In contrast, in a decoder-only model like GPT, there is no separate encoder component. Instead, the input sequence is directly fed into the decoder, which generates the output sequence by attending to the input sequence through self-attention mechanisms.

In both cases, the input sequence is typically a sequence of tokens that represent the text data being processed. The tokens may be words, subwords, or characters, depending on the specific modeling approach and the granularity of the text data being processed.

Hi thanks for the response. Should the output (illustrated in the figure) be shifted in this case? AND should I still give input to the multihead attention block of the decoder (based on the figure)? — mac179, Mar 09 '23 at 06:51
No need to give any further input to the multi-head attention (it gets the input from the previous block). The output shifting depends on the specific model implementation, usually, there's a "BOS" (or ) token added automatically — Tamir, Mar 09 '23 at 10:07

score 2 · Answer 2 · answered Jul 29 '23 at 12:11

Your question is not dumb, descriptions around the transformer model are sometimes a bit vague.

When GPT-x says it's using a "decoder-only" architecture, this also means that the decoder is then missing the encoder attention block -- as there is no encoder; see image below.

If you look closely, this now looks like the encoder, right? So you could also say that it's an "encoder-only" architecture. The important difference is that you still require the causal attention mask (i.e., the "do not look ahead" attention mask) which is the default for the decoder.

score 0 · Answer 3 · answered Jul 28 '23 at 12:40

0

If I am not mistaken, encoder-only stack should look something like this.

answered Jul 28 '23 at 12:40

Aniket

21
4

How does GPT-like transformers utilize only the decoder to do sequence generation?

3 Answers3