Upon closer inspection sparse gradients on Embeddings are optional and can be turned on or off with the sparse
parameter:
class torch.nn.Embedding(num_embeddings, embedding_dim, padding_idx=None, max_norm=None, norm_type=2, scale_grad_by_freq=False, sparse=False)
Where:
sparse (boolean, optional) – if True, gradient w.r.t. weight matrix
will be a sparse tensor. See Notes for more details regarding sparse
gradients.
And the "Notes" mentioned are what I quoted in the question about a limited number of optimizers being supported for sparse gradients.
Update:
It is theoretically possible but technically difficult to implement some optimization methods on sparse gradients. There is an open issue in the PyTorch repo to add support for all optimizers.
Regarding the original question, I believe Embeddings can be treated as sparse because it is possible to operate on the input indices directly rather than converting them to one-hot encodings for input into a dense layer. This is explained in @Maxim's answer to my related question.