11

I have a pytorch sparse tensor that I need sliced row/column wise using this slice [idx][:,idx] where idx is a list of indexes, using the mentioned slice yields my desired result on an ordinary float tensor. Is it possible applying the same slicing on a sparse tensor? Example here:

#constructing sparse matrix
i = np.array([[0,1,2,2],[0,1,2,1]])
v = np.ones(4)
i = torch.from_numpy(i.astype("int64"))
v = torch.from_numpy(v.astype("float32"))
test1 = torch.sparse.FloatTensor(i, v)

#constructing float tensor
test2 = np.array([[1,0,0],[0,1,0],[0,1,1]])
test2 = autograd.Variable(torch.cuda.FloatTensor(test2), requires_grad=False)

#slicing
idx = [1,2]
print(test2[idx][:,idx])

output:

Variable containing:
 1  0
 1  1
[torch.cuda.FloatTensor of size 2x2 (GPU 0)]

I am holding a 250.000 x 250.000 adjacency matrix, where I need to slice n rows and n columns, using the random idx, by simply sampling n random idx's. Since the dataset is so large it is not realistic to convert to a more convenient datatype.

can I achieve the same slicing result on test1? Is it even possible? If not, are there any work-arounds?

Right now I am running my model with the following "hack" of a solution:

idx = sorted(random.sample(range(0, np.shape(test1)[0]), 9000))
test1 = test1AsCsr[idx][:,idx].todense().astype("int32")
test1 = autograd.Variable(torch.cuda.FloatTensor(test1), requires_grad=False)

Where test1AsCsr is my test1 converted to a numpy CSR matrix. This solution works, it is however very slow, and makes my GPU utilization very low, since it needs to read/write from CPU memory, constantly.

Edit: Its fine with a non-sparse tensor as result

NicolaiF
  • 1,283
  • 1
  • 20
  • 44
  • 1
    I am not sure there is an easy way to do so for now (except by manually searching in `i` for the values in `idx`). There's an [issue](https://github.com/pytorch/pytorch/issues/7416) covering this problem. – benjaminplanche Jun 04 '18 at 10:34
  • I need this problem solved badly, so any solution, even if it isn't an easy one, would be more than welcome. How would you go about constructing the newly sliced matrix from the i vector? – NicolaiF Jun 04 '18 at 10:55
  • By "not easy", I meant "not optimal"... I'm not sure how the suggested workaround may work on larger tensors (if it even works the way you want). – benjaminplanche Jun 04 '18 at 12:52
  • I have attempted a solution using test1._indices() and np.where(), then constructing a new tensor from the results, this is however to slow, since the np.where() search is linear, and does not operate on the GPU. – NicolaiF Jun 05 '18 at 18:37
  • `scipy.sparse.csr` uses matrix multiplication to select multiple rows or columns. It constructs an `extractor` sparse matrix. I demonstrate it here: https://stackoverflow.com/questions/39500649/sparse-matrix-slicing-using-list-of-int – hpaulj Jun 06 '18 at 02:07

2 Answers2

5

Well it's been a couple of years since there was activity on this question, but better late than never.

This is the function I use for slicing sparse tensors. (Helper functions are below)

def slice_torch_sparse_coo_tensor(t, slices):
    """
    params:
    -------
    t: tensor to slice
    slices: slice for each dimension

    returns:
    --------
    t[slices[0], slices[1], ..., slices[n]]
    """

    t = t.coalesce()
    assert len(args) == len(t.size())
    for i in range(len(args)):
        if type(args[i]) is not torch.Tensor:
            args[i] = torch.tensor(args[i], dtype= torch.long)

    indices = t.indices()
    values = t.values()
    for dim, slice in enumerate(args):
        invert = False
        if t.size(0) * 0.6 < len(slice):
            invert = True
            all_nodes = torch.arange(t.size(0))
            unique, counts = torch.cat([all_nodes, slice]).unique(return_counts=True)
            slice = unique[counts==1]
        if slice.size(0) > 400:
            mask = ainb_wrapper(indices[dim], slice)
        else:
            mask = ainb(indices[dim], slice)
        if invert:
            mask = ~mask
        indices = indices[:, mask]
        values = values[mask]

        
    return torch.sparse_coo_tensor(indices, values, t.size()).coalesce()

Usage (took 2.4s on my machine):

indices = torch.randint(low= 0, high= 200000, size= (2, 1000000))
values = torch.rand(size=(1000000,))
t = torch.sparse_coo_tensor(indices, values, size=(200000, 200000))
idx = torch.arange(1000)
slice_coo(t, [idx, idx])

out:

tensor(indices=tensor([[ 13,  62,  66,  78, 134, 226, 233, 266, 299, 344, 349,
                        349, 369, 396, 421, 531, 614, 619, 658, 687, 769, 792,
                        810, 840, 926, 979],
                       [255, 479, 305, 687, 672, 867, 444, 559, 772,  96, 788,
                        980, 423, 699, 911, 156, 267, 721, 381, 781,  97, 271,
                        840, 292, 487, 185]]),
       values=tensor([0.4260, 0.4816, 0.8001, 0.8815, 0.3971, 0.4914, 0.7068,
                      0.2329, 0.4038, 0.1757, 0.7758, 0.3210, 0.2593, 0.8290,
                      0.1320, 0.4322, 0.7529, 0.8341, 0.8128, 0.4457, 0.4100,
                      0.1618, 0.4097, 0.3088, 0.6942, 0.5620]),
       size=(200000, 200000), nnz=26, layout=torch.sparse_coo)

Timings for slice_torch_sparse_coo_tensor:

%timeit slice_torch_sparse_coo_tensor(t, torch.randperm(200000)[:500], torch.arange(200000))

output:
    1.08 s ± 447 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

for the built-in torch.index_select (implemented here):

%timeit t.index_select(0, torch.arange(100))

output:
    56.7 s ± 4.87 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

These are the helper functions I use for this purpose, the function "ainb" finds the elements of a that are in b. I found this function in the internet a while ago but I can't find the post to link it.

import torch
def ainb(a,b):
    """gets mask for elements of a in b"""

    size = (b.size(0), a.size(0))

    if size[0] == 0: # Prevents error in torch.Tensor.max(dim=0)
        return torch.tensor([False]*a.size(0), dtype= torch.bool)
        
    a = a.expand((size[0], size[1]))
    b = b.expand((size[1], size[0])).T

    mask = a.eq(b).max(dim= 0).values

    return mask

def ainb_wrapper(a, b, splits = .72):
    inds = int(len(a)**splits)

    tmp = [ainb(a[i*inds:(i+1)*inds], b) for i in list(range(inds))]

    return torch.cat(tmp)

Since the function scales quadratically with the amount of elements I added a wrapper that splits the input into chunks and then concatenates the output. It's more efficient using only CPU, but I am not sure whether this holds when using a GPU, would appreciate it if someone could test it :)

It's my first time posting, so feedback on the quality of the post is also appreciated.

Prezt
  • 93
  • 1
  • 6
2

Possible answer for 2-dimentional sparse indices

Find an answer below, playing with several pytorch methods (torch.eq(), torch.unique(), torch.sort(), etc.) in order to output a compact, sliced tensor of shape (len(idx), len(idx)).

I tested several edge cases (unordered idx, v with 0s, i with multiple same index pairs, etc.), though I may have forgot some. Performance should also be checked.

import torch
import numpy as np

def in1D(x, labels):
    """
    Sub-optimal equivalent to numpy.in1D().
    Hopefully this feature will be properly covered soon
    c.f. https://github.com/pytorch/pytorch/issues/3025
    Snippet by Aron Barreira Bordin
    Args:
        x (Tensor):             Tensor to search values in
        labels (Tensor/list):   1D array of values to search for

    Returns:
        Tensor: Boolean tensor y of same shape as x, with y[ind] = True if x[ind] in labels

    Example:
        >>> in1D(torch.FloatTensor([1, 2, 0, 3]), [2, 3])
        FloatTensor([False, True, False, True])
    """
    mapping = torch.zeros(x.size()).byte()
    for label in labels:
        mapping = mapping | x.eq(label)
    return mapping


def compact1D(x):
    """
    "Compact" values 1D uint tensor, so that all values are in [0, max(unique(x))].
    Args:
        x (Tensor): uint Tensor

    Returns:
        Tensor: uint Tensor of same shape as x

    Example:
        >>> densify1D(torch.ByteTensor([5, 8, 7, 3, 8, 42]))
        ByteTensor([1, 3, 2, 0, 3, 4])
    """
    x_sorted, x_sorted_ind = torch.sort(x, descending=True)
    x_sorted_unique, x_sorted_unique_ind = torch.unique(x_sorted, return_inverse=True)
    x[x_sorted_ind] = x_sorted_unique_ind
    return x

# Input sparse tensor:
i = torch.from_numpy(np.array([[0,1,4,3,2,1],[0,1,3,1,4,1]]).astype("int64"))
v = torch.from_numpy(np.arange(1, 7).astype("float32"))
test1 = torch.sparse.FloatTensor(i, v)
print(test1.to_dense())
# tensor([[ 1.,  0.,  0.,  0.,  0.],
#         [ 0.,  8.,  0.,  0.,  0.],
#         [ 0.,  0.,  0.,  0.,  5.],
#         [ 0.,  4.,  0.,  0.,  0.],
#         [ 0.,  0.,  0.,  3.,  0.]])

# note: test1[1, 1] = v[i[1,:]] + v[i[6,:]] = 2 + 6 = 8
#       since both i[1,:] and i[6,:] are [1,1]

# Input slicing indices:
idx = [4,1,3]

# Getting the elements in `i` which correspond to `idx`:
v_idx = in1D(i, idx).byte()
v_idx = v_idx.sum(dim=0).squeeze() == i.size(0) # or `v_idx.all(dim=1)` for pytorch 0.5+
v_idx = v_idx.nonzero().squeeze()

# Slicing `v` and `i` accordingly:
v_sliced = v[v_idx]
i_sliced = i.index_select(dim=1, index=v_idx)

# Building sparse result tensor:
i_sliced[0] = compact1D(i_sliced[0])
i_sliced[1] = compact1D(i_sliced[1])

# To make sure to have a square dense representation:
size_sliced = torch.Size([len(idx), len(idx)])
res = torch.sparse.FloatTensor(i_sliced, v_sliced, size_sliced)

print(res)
# torch.sparse.FloatTensor of size (3,3) with indices:
# tensor([[ 0,  2,  1,  0],
#         [ 0,  1,  0,  0]])
# and values:
# tensor([ 2.,  3.,  4.,  6.])

print(res.to_dense())
# tensor([[ 8.,  0.,  0.],
#         [ 4.,  0.,  0.],
#         [ 0.,  3.,  0.]])

Previous answer for 1-dimentional sparse indices

Here is a (probably sub-optimal and not covering all edge cases) solution, following the intuitions shared in a related open issue (hopefully this feature will be properly covered soon):

# Constructing a sparse tensor a bit more complicated for the sake of demo:
i = torch.LongTensor([[0, 1, 5, 2]])
v = torch.FloatTensor([[1, 3, 0], [5, 7, 0], [9, 9, 9], [1,2,3]])
test1 = torch.sparse.FloatTensor(i, v)

# note: if you directly have sparse `test1`, you can get `i` and `v`:
# i, v = test1._indices(), test1._values()

# Getting the slicing indices:
idx = [1,2]

# Preparing to slice `v` according to `idx`.
# For that, we gather the list of indices `v_idx` such that i[v_idx[k]] == idx[k]:
i_squeeze = i.squeeze()
v_idx = [(i_squeeze == j).nonzero() for j in idx] # <- doesn't seem optimal...
v_idx = torch.cat(v_idx, dim=1)

# Slicing `v` accordingly:
v_sliced = v[v_idx.squeeze()][:,idx]

# Now defining your resulting sparse tensor.
# I'm not sure what kind of indexing you want, so here are 2 possibilities:
# 1) "Dense" indixing:
test1x = torch.sparse.FloatTensor(torch.arange(v_idx.size(1)).long().unsqueeze(0), v_sliced)
print(test1x)
# torch.sparse.FloatTensor of size (3,2) with indices:
#
#  0  1
# [torch.LongTensor of size (1,2)]
# and values:
#
#  7  0
#  2  3
# [torch.FloatTensor of size (2,2)]

# 2) "Sparse" indixing using the original `idx`:
test1x = torch.sparse.FloatTensor(autograd.Variable(torch.LongTensor(idx)).unsqueeze(0), v_sliced)
# note: this indexing would fail if elements of `idx` were not in `i`.
print(test1x)
# torch.sparse.FloatTensor of size (3,2) with indices:
#
#  1  2
# [torch.LongTensor of size (1,2)]
# and values:
#
#  7  0
#  2  3
# [torch.FloatTensor of size (2,2)]
benjaminplanche
  • 14,689
  • 5
  • 57
  • 69
  • This might do what I want, but I am having weird issues with pytorch not letting me write to variables making it difficult to test: https://discuss.pytorch.org/t/something-goes-horribly-wrong-when-i-try-to-create-variables/19182 – NicolaiF Jun 04 '18 at 15:56
  • What is the i.squeeze() for? – NicolaiF Jun 05 '18 at 18:19
  • Updated the question, the indices I am given is 2 dimensional – NicolaiF Jun 05 '18 at 18:33
  • Find a new suggested solution. Once again, it's rather sub-optimal and probably could be improved, but maybe it will suit your urgent needs... – benjaminplanche Jun 06 '18 at 11:03
  • This removes the columns/rows, if they are entirely 0, that would hamper with the dimensions, I need the output to be len(idx)*len(idx) – NicolaiF Jun 06 '18 at 11:54
  • Right, the "sparse to dense" conversion of the results was erroneous. Not sure how to cover that... – benjaminplanche Jun 06 '18 at 13:07
  • 1
    Well, I gave it some more thoughts (it's a really interesting problem!) and came up with a possible solution. Let me know what you think. :) – benjaminplanche Jun 06 '18 at 23:26
  • 1
    I think it works, its just to slow - it might not be possible with reasonable speeds, I tried a dictionary based approach myself, to avail :( – NicolaiF Jun 07 '18 at 19:32