0

I have a DataFrame which I save/read from a csv file, and I want to create a Term Density Matrix DataFrame from it. Following herrfz's suggestion here, I use CounVectorizer from sklearn. I wrapped that code in a function

    from sklearn.feature_extraction.text import CountVectorizer
    countvec = CountVectorizer()
    from scipy.sparse import coo_matrix, csc_matrix, hstack

    def df2tdm(df,titleColumn,placementColumn):
        '''
        Takes in a DataFrame with at least two columns, and returns a dataframe with the term density matrix
        of the words appearing in the titleColumn

        Inputs: df, a DataFrame containing titleColumn, placementColumn among other columns
        Outputs: tdm_df, a DataFrame containing placementColumn and columns with all the words appearrig in df.titleColumn

        Credits: 
        https://stackoverflow.com/questions/22205845/efficient-way-to-create-term-density-matrix-from-pandas-dataframe
        '''
        tdm_df = pd.DataFrame(countvec.fit_transform(df[titleColumn]).toarray(), columns=countvec.get_feature_names())
        tdm_df = tdm_df.join(pd.DataFrame(df[placementColumn]))
        return tdm_df

Which returns the TDM as a DataFrame, for example:

    df = pd.DataFrame({'title':['Delicious boiled egg','Fried egg ', 'Potato salad', 'Split orange','Something else'], 'page':[1, 1, 2, 3, 4]})
    print df.head()
    tdm_df = df2tdm(df,'title','page')
    tdm_df.head()

       boiled  delicious  egg  else  fried  orange  potato  salad  something  \
    0       1          1    1     0      0       0       0      0          0   
    1       0          0    1     0      1       0       0      0          0   
    2       0          0    0     0      0       0       1      1          0   
    3       0          0    0     0      0       1       0      0          0   
    4       0          0    0     1      0       0       0      0          1   

       split  page  
    0      0     1  
    1      0     1  
    2      0     2  
    3      1     3  
    4      0     4  

This implementation suffers from bad memory scaling: When I use a DataFrame which occupies 190 kB saved as utf8, the function uses ~200 MB to create the TDM dataframe. When the csv file is 600 kB, the function uses 700 MB, and when the csv is 3.8 MB the function uses up all of my memory and swap file (8 GB) and crashes.

I also made an implementation using sparse matrices and sparse DataFrames (below), but the memory usage is pretty much the same, only it is considerably slower

    def df2tdm_sparse(df,titleColumn,placementColumn):
        '''
        Takes in a DataFrame with at least two columns, and returns a dataframe with the term density matrix
        of the words appearing in the titleColumn. This implementation uses sparse DataFrames.

        Inputs: df, a DataFrame containing titleColumn, placementColumn among other columns
        Outputs: tdm_df, a DataFrame containing placementColumn and columns with all the words appearrig in df.titleColumn

        Credits: 
        https://stackoverflow.com/questions/22205845/efficient-way-to-create-term-density-matrix-from-pandas-dataframe
        https://stackoverflow.com/questions/17818783/populate-a-pandas-sparsedataframe-from-a-scipy-sparse-matrix
        https://stackoverflow.com/questions/6844998/is-there-an-efficient-way-of-concatenating-scipy-sparse-matrices
        '''
        pm = df[[placementColumn]].values
        tm = countvec.fit_transform(df[titleColumn])#.toarray()
        m = csc_matrix(hstack([pm,tm]))
        dfout = pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel()) for i in np.arange(m.shape[0]) ])
        dfout.columns = [placementColumn]+countvec.get_feature_names()
        return dfout

Any suggestions on how to improve memory usage? I wonder if this is related to the memory issues of scikit, e.g. here

Community
  • 1
  • 1
nikosd
  • 919
  • 3
  • 16
  • 26
  • 2
    your sparse conversion is not doing anything; you need to represent say the 0's by nan first. But the bigger question is why do you need a frame for this? a scipy sparse repr or scikit-kearn repr does the job here. (and easier to isolate where the memory issue is) – Jeff Mar 06 '14 at 11:06

1 Answers1

0

I also think that the problem might be with the conversion from sparse matrix to sparse data frame.

try this function (or something similar)

 def SparseMatrixToSparseDF(xSparseMatrix):
     import numpy as np
     import pandas as pd
     def ElementsToNA(x):
          x[x==0] = NaN
     return x 
    xdf1 = 
      pd.SparseDataFrame([pd.SparseSeries(ElementsToNA(xSparseMatrix[i].toarray().ravel())) 
for i in np.arange(xSparseMatrix.shape[0]) ])
  return xdf1

you can see that it reduces the size by using function density

 df1.density

I hope it helps

user1043144
  • 2,680
  • 5
  • 29
  • 45