There are a few questions on SO dealing with using groupby
with sparse matrices. However the output seem to be lists, dictionaries, dataframes and other objects.
I'm working on an NLP problem and would like to keep all the data in sparse scipy matrices during processing to prevent memory errors.
Here's the context:
I have vectorized some documents (sample data here):
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_csv('groupbysparsematrix.csv')
docs = df['Text'].tolist()
vectorizer = CountVectorizer()
train_X = vectorizer.fit_transform(docs)
print("Dimensions of training set: {0}".format(train_X.shape))
print type(train_X)
Dimensions of training set: (8, 180)
<class 'scipy.sparse.csr.csr_matrix'>
From the original dataframe I use the date, in a day of the year format, to create the groups I would like to sum over:
from scipy import sparse, hstack
df['Date'] = pd.to_datetime(df['Date'])
groups = df['Date'].apply(lambda x: x.strftime('%j'))
groups_X = sparse.csr_matrix(groups.astype(float)).T
train_X_all = sparse.hstack((train_X, groups_X))
print("Dimensions of concatenated set: {0}".format(train_X_all.shape))
Dimensions of concatenated set: (8, 181)
Now I'd like to apply groupby
(or a similar function) to find the sum of each token per day. I would like the output to be another sparse scipy matrix.
The output matrix would be 3 x 181 and look something like this:
1, 1, 1, ..., 2, 1, 3
2, 1, 3, ..., 1, 1, 4
0, 0, 0, ..., 1, 2, 5
Where the columns 1 to 180 represent the tokens and column 181 represents the day of the year.