3

I have a dataframe of dimensions (42 million rows, 6 columns) that I need to do a crosstab on to get counts of specific events for each person in the dataset that will result in a very large sparse matrix of size ~1.5 million rows by 36,000 columns. When I try this with pandas crosstab (pd.crosstab) function I run out of memory on my system. Is there some way to do this crosstab in chunks and join the resulting dataframes? To be clear, each row of the crosstab will count the number of times an event occurred for each person in the dataset (i.e. each row is a person, each column entry is the count of the times that person participated in a specific event). The ultimate goal is to factor the resulting person-event matrix using PCA/SVD.

user13999
  • 145
  • 6
  • You can try dask but, given the matrix is going to be sparse, I'd rather do a groupby event, person count and try to arrange data from there. – rpanai Feb 25 '20 at 21:59
  • 1
    Thanks for suggestion but if I'm understanding correctly I don't think that will work for my application. I need to try to factor this person-event matrix (PCA or SVD) so I need person level data to be preserved. – user13999 Feb 25 '20 at 22:05
  • 1
    This might help: https://stackoverflow.com/questions/44729727/pandas-slice-large-dataframe-in-chunks – K753 Feb 25 '20 at 22:10
  • Thanks for the helpful link! – user13999 Feb 25 '20 at 22:34

1 Answers1

4

Setup

source_0 = [*'ABCDEFGHIJ']
source_1 = [*'abcdefghij']

np.random.seed([3, 1415])

df = pd.DataFrame({
    'source_0': np.random.choice(source_0, 100),
    'source_1': np.random.choice(source_1, 100),
})

df

   source_0 source_1
0         A        b
1         C        b
2         H        f
3         D        a
4         I        h
..      ...      ...
95        C        f
96        F        a
97        I        j
98        I        d
99        J        b

Use pd.factorize to get an integer factorization... and unique values

ij, tups = pd.factorize(list(zip(*map(df.get, df))))
result = dict(zip(tups, np.bincount(ij)))

This is already a compact form. But you can convert it to a pandas.Series and unstack to verify it is what we want.

pd.Series(result).unstack(fill_value=0)

   a  b  c  d  e  f  g  h  i  j
A  2  1  0  0  0  1  0  2  1  1
B  0  1  0  0  0  1  0  1  0  1
C  0  3  1  3  0  2  0  0  0  0
D  3  0  0  2  0  0  1  3  0  2
E  3  0  0  1  0  1  2  5  0  0
F  4  0  2  1  1  1  1  1  1  0
G  0  2  1  0  0  2  3  0  3  1
H  1  3  2  0  2  1  1  1  0  2
I  2  2  1  1  2  0  1  2  0  2
J  0  1  1  0  1  1  0  1  0  1

Using sparse

from scipy.sparse import csr_matrix

i, r = pd.factorize(df['source_0'])
j, c = pd.factorize(df['source_1'])
ij, tups = pd.factorize(list(zip(i, j)))

a = csr_matrix((np.bincount(ij), tuple(zip(*tups))))

b = pd.DataFrame.sparse.from_spmatrix(a, r, c).sort_index().sort_index(axis=1)

b

   a  b  c  d  e  f  g  h  i  j
A  2  1  0  0  0  1  0  2  1  1
B  0  1  0  0  0  1  0  1  0  1
C  0  3  1  3  0  2  0  0  0  0
D  3  0  0  2  0  0  1  3  0  2
E  3  0  0  1  0  1  2  5  0  0
F  4  0  2  1  1  1  1  1  1  0
G  0  2  1  0  0  2  3  0  3  1
H  1  3  2  0  2  1  1  1  0  2
I  2  2  1  1  2  0  1  2  0  2
J  0  1  1  0  1  1  0  1  0  1
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Interesting suggestion. You're using some methods I'm not familiar with. To make sure I understand, in the df you created, column 'Source 1' would have the repeated id of the person (repeated each time an event occurred) and 'Source 2' would have the event type id? – user13999 Feb 25 '20 at 22:21
  • Yes! However, if you propose some sample data of your own, I'm happy to repeat the process. Also, I'm working on including some sparse solutions as well – piRSquared Feb 25 '20 at 22:22
  • Clever solution, thanks! Is there some way to unstack your two column df into a sparse matrix using scipy.sparse.csc_matrix so as a next step I can factorize with PCA/SVD using something like TruncatedSVD() in sklearn? – user13999 Feb 25 '20 at 22:29
  • Thanks! Looks promising. Will give it a try and accept answer if it works for me. – user13999 Feb 26 '20 at 02:53
  • I am getting an error when I try your scipy sparse matrix solution. Specifically, this line: "ij, tups = pd.factorize(list(zip(i, j)))" is giving me "ValueError: Buffer has wrong number of dimensions (expected 1, got 2)". The two pd.factorize() lines seem to work ok. Any suggestions? – user13999 Feb 26 '20 at 19:21
  • pd.factorize() appears to work correctly if you first convert list(zip(i,j)) to a pandas series. (i.e. pd.factorize(pd.Series(list(zip(i,j))) works correctly). This could be a pandas version issue (I am using 0.18.1 in conda environment of project). – user13999 Feb 26 '20 at 21:56