Why are Embeddings in PyTorch implemented as Sparse Layers?

Question

Embedding Layers in PyTorch are listed under "Sparse Layers" with the limitation:

Keep in mind that only a limited number of optimizers support sparse gradients: currently it’s optim.SGD (cuda and cpu), and optim.Adagrad (cpu)

What is the reason for this? For example in Keras I can train an architecture with an Embedding Layer using any optimizer.

It is better to ask such question on [PyTorch forum](https://discuss.pytorch.org/). — jdhao, Dec 18 '17 at 13:03
Good idea! So good in fact that I was immediately able to answer my question after searching the PyTorch forums! — Imran, Dec 18 '17 at 13:09
You answer did not really answer you question, i.e., why embedding is implemented as sparse layer. — jdhao, Dec 18 '17 at 13:15
Good point. I'll see if I can figure it out and make an update to my answer. — Imran, Dec 18 '17 at 13:17

Imran · Accepted Answer · 2017-12-18T14:59:19.470

Upon closer inspection sparse gradients on Embeddings are optional and can be turned on or off with the sparse parameter:

class torch.nn.Embedding(num_embeddings, embedding_dim, padding_idx=None, max_norm=None, norm_type=2, scale_grad_by_freq=False, sparse=False)

Where:

sparse (boolean, optional) – if True, gradient w.r.t. weight matrix will be a sparse tensor. See Notes for more details regarding sparse gradients.

And the "Notes" mentioned are what I quoted in the question about a limited number of optimizers being supported for sparse gradients.

Update:

It is theoretically possible but technically difficult to implement some optimization methods on sparse gradients. There is an open issue in the PyTorch repo to add support for all optimizers.

Regarding the original question, I believe Embeddings can be treated as sparse because it is possible to operate on the input indices directly rather than converting them to one-hot encodings for input into a dense layer. This is explained in @Maxim's answer to my related question.

Why are Embeddings in PyTorch implemented as Sparse Layers?

1 Answers1

Linked