Creating Binary Sparse matrix in python using pandas which should be fast and efficient

Asked Jan 28 '18 at 16:03

Active Jan 28 '18 at 16:51

Viewed 18 times

I have a pandas dataframe like this

    document  term
      X        a
      X        b
      X        a
      X        c
      Y        a
      Y        c
      Y        d

I want to create sparse matrix like this: This sparse matrix has rows as unique documents, columns as unique terms. I want to fill 1 if document and term co-exists in original dataframe irrespective number of times they co-existed, else 0

          a    b     c    d
      X   1    1     1    0

      Y   1    0     1    1

I have tried with for loop, it is time consuming with million rows.

My Answer after suggestion from piRsquared:

    #drop duplicates
    df = df.drop_duplicates()
    df.pivot_table(index='document', columns='term', fill_value=0, aggfunc='size')

edited Jan 28 '18 at 16:51

asked Jan 28 '18 at 16:03

pat

Sparse = less 1 the 0 by quite a margin. Yours is not. – Patrick Artner Jan 28 '18 at 16:04
It is called cross tabulation. You’ll find your answer here https://stackoverflow.com/a/47152692/2336654 – piRSquared Jan 28 '18 at 16:11
@piRSquared , In cross tabulation, the values are filled with frequency of term occured, but mine is bit different attempting to create sparse matrix of 0's / 1's. – pat Jan 28 '18 at 16:34
@Nishal you can either drop duplicates in the original data or clip the results at 1. – piRSquared Jan 28 '18 at 16:35
@piRSquared, yes, it is working on removing duplicates and creating my required sparse matrix. – pat Jan 28 '18 at 16:47

Creating Binary Sparse matrix in python using pandas which should be fast and efficient

0 Answers0