4

There are a few questions on SO dealing with using groupby with sparse matrices. However the output seem to be lists, dictionaries, dataframes and other objects.

I'm working on an NLP problem and would like to keep all the data in sparse scipy matrices during processing to prevent memory errors.

Here's the context:

I have vectorized some documents (sample data here):

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

df = pd.read_csv('groupbysparsematrix.csv')
docs = df['Text'].tolist()

vectorizer = CountVectorizer()
train_X = vectorizer.fit_transform(docs)

print("Dimensions of training set: {0}".format(train_X.shape))
print type(train_X)

Dimensions of training set: (8, 180)
<class 'scipy.sparse.csr.csr_matrix'>

From the original dataframe I use the date, in a day of the year format, to create the groups I would like to sum over:

from scipy import sparse, hstack    

df['Date'] = pd.to_datetime(df['Date'])
groups = df['Date'].apply(lambda x: x.strftime('%j'))
groups_X = sparse.csr_matrix(groups.astype(float)).T
train_X_all = sparse.hstack((train_X, groups_X))

print("Dimensions of concatenated set: {0}".format(train_X_all.shape))

Dimensions of concatenated set: (8, 181)

Now I'd like to apply groupby (or a similar function) to find the sum of each token per day. I would like the output to be another sparse scipy matrix.

The output matrix would be 3 x 181 and look something like this:

 1, 1, 1, ..., 2, 1, 3
 2, 1, 3, ..., 1, 1, 4
 0, 0, 0, ..., 1, 2, 5

Where the columns 1 to 180 represent the tokens and column 181 represents the day of the year.

Community
  • 1
  • 1
Andrew Brown
  • 333
  • 3
  • 12
  • Are you talking about the `pandas` `groupby`? Can you give a working example with dense arrays? There is a pandas sparse format, but it's interaction with sparse matrix is still under development. – hpaulj Sep 23 '16 at 00:47
  • Column 181 - is that sparse or not? – hpaulj Sep 23 '16 at 01:17
  • column 181 (groups_X) is a `scipy.sparse.csc.csc_matrix` but in reality it is dense given that each observation has a date. – Andrew Brown Sep 23 '16 at 01:51

2 Answers2

6

The best way of calculating the sum of selected columns (or rows) of a csr sparse matrix is a matrix product with another sparse matrix that has 1's where you want to sum. In fact csr sum (for a whole row or column) works by matrix product, and index rows (or columns) is also done with a product (https://stackoverflow.com/a/39500986/901925)

So I'd group the dates array, and use that information to construct the summing 'mask'.

For sake of discussion, consider this dense array:

In [117]: A
Out[117]: 
array([[0, 2, 7, 5, 0, 7, 0, 8, 0, 7],
       [0, 0, 3, 0, 0, 1, 2, 6, 0, 0],
       [0, 0, 0, 0, 2, 0, 5, 0, 0, 0],
       [4, 0, 6, 0, 0, 5, 0, 0, 1, 4],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 7, 0, 8, 1, 0, 9, 0, 2, 4],
       [9, 0, 8, 4, 0, 0, 0, 0, 9, 7],
       [0, 0, 0, 1, 2, 0, 2, 0, 4, 7],
       [3, 0, 1, 0, 0, 0, 0, 0, 0, 2],
       [0, 0, 1, 8, 5, 0, 0, 0, 8, 0]])

Make a sparse copy:

In [118]: M=sparse.csr_matrix(A)

generate some groups, based on the last column; collections.defaultdict is a convenient tool to do this:

In [119]: grps=defaultdict(list)
In [120]: for i,v in enumerate(A[:,-1]):
     ...:     grps[v].append(i)

In [121]: grps
Out[121]: defaultdict(list, {0: [1, 2, 4, 9], 2: [8], 4: [3, 5], 7: [0, 6, 7]})

I can iterate on those groups, collect rows of M, sum across those rows and produce:

In [122]: {k:M[v,:].sum(axis=0) for k, v in grps.items()}
Out[122]: 
{0: matrix([[0, 0, 4, 8, 7, 2, 7, 6, 8, 0]], dtype=int32),
 2: matrix([[3, 0, 1, 0, 0, 0, 0, 0, 0, 2]], dtype=int32),
 4: matrix([[4, 7, 6, 8, 1, 5, 9, 0, 3, 8]], dtype=int32),
 7: matrix([[ 9,  2, 15, 10,  2,  7,  2,  8, 13, 21]], dtype=int32)}

In the last column, values include 2*4, and 3*7

So there are 2 tasks - collecting the groups, whether with this defaultdict, or itertools.groupby (which in this case would require sorting), or pandas groupby. And secondly this collection of rows and summing. This dictionary iteration is conceptually simple.

A masking matrix might work like this:

In [141]: mask=np.zeros((10,10),int)
In [142]: for i,v in enumerate(A[:,-1]): # same sort of iteration
     ...:     mask[v,i]=1
     ...:     
In [143]: Mask=sparse.csr_matrix(mask)
...
In [145]: Mask.A
Out[145]: 
array([[0, 1, 1, 0, 1, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       ....
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)
In [146]: (Mask*M).A
Out[146]: 
array([[ 0,  0,  4,  8,  7,  2,  7,  6,  8,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 3,  0,  1,  0,  0,  0,  0,  0,  0,  2],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 4,  7,  6,  8,  1,  5,  9,  0,  3,  8],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 9,  2, 15, 10,  2,  7,  2,  8, 13, 21],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0]], dtype=int32)

This Mask*M has the same values as the dictionary row, but with the extra 0s. I can isolate the nonzero values with the lil format:

In [147]: (Mask*M).tolil().data
Out[147]: 
array([[4, 8, 7, 2, 7, 6, 8], [], [3, 1, 2], [],
       [4, 7, 6, 8, 1, 5, 9, 3, 8], [], [],
       [9, 2, 15, 10, 2, 7, 2, 8, 13, 21], [], []], dtype=object)

I can construct the Mask matrix directly using the coo sparse style of input:

Mask = sparse.csr_matrix((np.ones(A.shape[0],int),
    (A[:,-1], np.arange(A.shape[0]))), shape=(A.shape))

That should be faster and avoid the memory error (no loop or large dense array).

Community
  • 1
  • 1
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Great solutions. These work well on small data sets but are there alternations that would allow them to be applied to larger datasets? My particular problem is a (69424, 685296) sparse matrix. jupyter hangs when I attempt `{k:M[v,:].sum(axis=0) for k, v in grps.items()}` and I get a MemoryError when I attempt to make a (69424, 685296) mask of zeros. – Andrew Brown Sep 23 '16 at 10:56
  • I added a method of constructing the group Mask matrix directly with the `coo` style of inputs. – hpaulj Sep 23 '16 at 15:40
  • I have tried your edits with the sample A data and my own data but I receive a `ValueError: setting an array element with a sequence.` I have searched SO and tried to solve the problem by changing the data type and rearranging the syntax but can't get it to work. Any thoughts? – Andrew Brown Sep 23 '16 at 22:46
  • I'd have to see the error stack. Normally that error is produced by a statement like `x[i] = np.array([1,2,3])` where `x` is 1d. The the RHS expects a scalar value, but the LHS is giving an array or list (i.e. sequence). So something has more dimensions than expected. – hpaulj Sep 23 '16 at 23:45
  • Thanks for your help. The problem with the `csr_matrix((data, (row, col)), shape=())` solution is that it uses `A` which is a np array. If you try it with `M` - the csr matrix - you get the `ValueError: setting an array element with a sequence`. My thinking is any solution that employs dense matrices will fail on large datasets. – Andrew Brown Sep 24 '16 at 03:22
  • Can you use the date column in its dense form? `M[:,-1].A.ravel()` or `groups` before the `csr` conversion and `hstack`? We could also work form the sparse form, but you said this was dense anyways, a date for row. – hpaulj Sep 24 '16 at 03:30
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/124087/discussion-between-andrew-brown-and-hpaulj). – Andrew Brown Sep 24 '16 at 04:19
3

Here is a trick using LabelBinarizer and matrix multiplication.

from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer(sparse_output=True)
grouped = lb.fit_transform(groups).T.dot(train_X)

grouped is the output sparse matrix of size 3 x 180. And you can find the list of its groups in lb.classes_.

Sociopath
  • 13,068
  • 19
  • 47
  • 75
Sergey Zakharov
  • 1,493
  • 3
  • 21
  • 40