0

I have a pandas dataframe like this

df

    document  term
      X        a
      X        b
      X        a
      X        c
      Y        a
      Y        c
      Y        d

I want to create sparse matrix like this: This sparse matrix has rows as unique documents, columns as unique terms. I want to fill 1 if document and term co-exists in original dataframe irrespective number of times they co-existed, else 0

          a    b     c    d
      X   1    1     1    0

      Y   1    0     1    1

I have tried with for loop, it is time consuming with million rows.

My Answer after suggestion from piRsquared:

    #drop duplicates
    df = df.drop_duplicates()
    df.pivot_table(index='document', columns='term', fill_value=0, aggfunc='size')
pat
  • 135
  • 2
  • 10
  • Sparse = less 1 the 0 by quite a margin. Yours is not. – Patrick Artner Jan 28 '18 at 16:04
  • It is called cross tabulation. You’ll find your answer here https://stackoverflow.com/a/47152692/2336654 – piRSquared Jan 28 '18 at 16:11
  • @piRSquared , In cross tabulation, the values are filled with frequency of term occured, but mine is bit different attempting to create sparse matrix of 0's / 1's. – pat Jan 28 '18 at 16:34
  • @Nishal you can either drop duplicates in the original data or clip the results at 1. – piRSquared Jan 28 '18 at 16:35
  • @piRSquared, yes, it is working on removing duplicates and creating my required sparse matrix. – pat Jan 28 '18 at 16:47

0 Answers0