7

I have two matrices of interest, the first is a "bag of words" matrix, with two columns: the document ID and the term ID. For example:

bow[0:10]

Out[1]:
    array([[ 0, 10],
           [ 0, 12],
           [ 0, 19],
           [ 0, 20],
           [ 1,  9],
           [ 1, 24],
           [ 2, 33],
           [ 2, 34],
           [ 2, 35],
           [ 3, 2]])

In addition, I have an "index" matrix, where every row in the matrix contains the index of the first and last row for a given document ID in the bag of words matrix. Ex: row 0 is the first and last index for doc id 0. For example:

index[0:4]

Out[2]:
    array([[ 0,  4],
           [ 4,  6],
           [ 6,  9],
           [ 9, 10]])

What I'd like to do is take a random sample of document ID's and get all of the bag of word rows for those document ID's. The bag of words matrix is roughly 150M rows (~1.5Gb), so using numpy.in1d() is too slow. We need to return these rapidly for feeding into a downstream task.

The naive solution I have come up with is as follows:

def get_rows(ids):
    indices = np.concatenate([np.arange(x1, x2) for x1,x2 in index[ids]])
    return bow[indices]

get_rows([4,10,3,5])

Generic sample

A generic sample to put forth the problem would be with something like this -

indices = np.array([[ 4, 7],
                    [10,16],
                    [11,18]]

The expected output would be -

array([ 4,  5,  6, 10, 11, 12, 13, 14, 15, 11, 12, 13, 14, 15, 16, 17])
Divakar
  • 218,885
  • 19
  • 262
  • 358
Alexander David
  • 769
  • 2
  • 8
  • 19
  • Considering that the output you're trying to produce is jagged, there's not going to be a nice, vectorized solution. – user2357112 Nov 05 '17 at 19:36
  • 2
    Since the endpoints are adjacent to the starts of the next groups, the concatenated o/p would be simply `range(a[0,0],a[-1,-1])`, right? – Divakar Nov 05 '17 at 19:37
  • Producing the `concatenate` output without going through the jagged intermediate array might be possible vectorized, though. – user2357112 Nov 05 '17 at 19:37
  • ...hey, yeah, are the ranges all going to be contiguous like that? Divakar brings up a good point, if they are. – user2357112 Nov 05 '17 at 19:38
  • @Divark - not necessarily, it might be a random list of indices, ex: arr[[5,100, 31, 123]]. You are correct though, they are jagged. My current method: sparse_rows = np.concatenate([np.arange(x1, x2) for x1,x2 in arr[idxs]]) is pretty slow – Alexander David Nov 05 '17 at 19:40
  • Add a more generic, representative sample please. Also, would be it okay to have a boolean array as output with the places as True valued for the indices that are present? – Divakar Nov 05 '17 at 19:42
  • @Divakar - I added more detail around the actual use case to shed some light. Let me know if there is anything I can do. Thank you for your help! – Alexander David Nov 05 '17 at 20:02
  • `index` again seems to have the ends and next starts adjacent in the edited sample :) – Divakar Nov 05 '17 at 20:09
  • emphasis added ("random sample" vs "sample") and an example of get_rows with non-sequential data – Alexander David Nov 05 '17 at 20:21

1 Answers1

9

Think I have cracked it finally with a cumsum trick for a vectorized solution -

def create_ranges(a):
    l = a[:,1] - a[:,0]
    clens = l.cumsum()
    ids = np.ones(clens[-1],dtype=int)
    ids[0] = a[0,0]
    ids[clens[:-1]] = a[1:,0] - a[:-1,1]+1
    out = ids.cumsum()
    return out

Sample runs -

In [416]: a = np.array([[4,7],[10,16],[11,18]])

In [417]: create_ranges(a)
Out[417]: array([ 4,  5,  6, 10, 11, 12, 13, 14, 15, 11, 12, 13, 14, 15, 16, 17])

In [425]: a = np.array([[-2,4],[-5,2],[11,12]])

In [426]: create_ranges(a)
Out[426]: array([-2, -1,  0,  1,  2,  3, -5, -4, -3, -2, -1,  0,  1, 11])

If we are given starts and stops as two 1D arrays, we just need to use those in place of the first and second columns. For completeness, here's the complete code -

def create_ranges(starts, ends):
    l = ends - starts
    clens = l.cumsum()
    ids = np.ones(clens[-1],dtype=int)
    ids[0] = starts[0]
    ids[clens[:-1]] = starts[1:] - ends[:-1]+1
    out = ids.cumsum()
    return out
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • @yatu Yeah the questions are basically the same, hence closed as dup. Poeple who are looking for performance being linked could be encouraged to visit that too. It's better to have one Q&A for related question. You might also want to add that answer post here. – Divakar Mar 24 '20 at 19:05
  • The usual dup closing that I have seen people do that is to keep the oldest one through my few years of experience on SO. Hence, following the same. Hope it's taken in good faith :) – Divakar Mar 24 '20 at 19:13