My input is a list of values, data_idx
. In the example, the values range from [0, 5].
data_idx = [2, 5, 5, 0, 4, 1, 4, 5, 3, 2, 1, 0, 3, 3, 0]
My desired output, filled_matrix
is a tensor of shape max(value)
by len(data_idx)
where each row, r
, of the tensor contains all of indices where data_idx == r
and -1 for the rest of the row if the number of matched indices is fewer than len(data_idx)
For example, in the first row, r=0
, data_idx==0
at indices [3, 11, 14]
. The full output would look like:
filled_matrix = tensor([[ 3, 11, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
[ 5, 10, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
[ 0, 9, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
[ 8, 12, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
[ 4, 6, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
[ 1, 2, 7, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]],
dtype=torch.int8)
I have working for-loop code that accomplishes my goal.
import torch
max_idx = 6
data_idx = torch.tensor([2, 5, 5, 0, 4, 1, 4, 5, 3, 2, 1, 0, 3, 3, 0]).cuda()
max_number_data_idx = data_idx.shape[0]
filled_matrix = torch.zeros([max_idx, max_number_data_idx], dtype=torch.int8, device='cuda')
filled_matrix.fill_(-1)
for i in range(max_idx):
same_idx = (data_idx == i).nonzero().flatten()
filled_matrix[i][:same_idx.shape[0]] = same_idx
Now, I want to speed up this code. Specifically, I want it to be faster on the GPU. In the real scenario, the input, data_idx
, can be a list containing millions of values. In that case, e.g 1 M of different values, the GPU will be call 1 M of time which make it very slow. My code is sequential and GPU hate sequential code.
Is there a function which will produce the same result more efficiently? Or a way to vectorize this for-loop?