1

This error is weird and I cant even find anything on google about it.

I'm attempting to hot encode a column in an existing sparse dataframe,

combined_cats is a set of all the possible categories. column_name is a generic column name.

df[column_name] = df[column_name].astype('category', categories=combined_cats,copy=False)

However, this fails with the error in the title. I figured that you cant hot encode a sparse matrix, but I can't seem to convert it back to a dense matrix by to_dense(), as it says numpy ndarray has no such method.

I attempted using as_matrix() and resetting the column:

df[column_name] = df[column_name].as_matrix()
df[column_name] = df[column_name].astype('category', categories=combined_cats,copy=False)

Which didn't work either. Is there something im doing wrong? The error occurs when I try to use combined_cats.

eg:

def hot_encode_column_in_both_datasets(column_name,df,df2,sparse=True):
    col1b = set(df2[column_name].unique())
    col1a = set(df[column_name].unique())
    combined_cats = list(col1a.union(col1b))
    df[column_name] = df[column_name].astype('category', categories=combined_cats,copy=False)
    df2[column_name] = df2[column_name].astype('category', categories=combined_cats,copy=False)

    df = pd.get_dummies(df, columns=[column_name],sparse=sparse)
    df2 = pd.get_dummies(df2, columns=[column_name],sparse=sparse)
    try:
        del df[column_name]
        del df2[column_name]
    except:
        pass
    return df,df2



df = pd.DataFrame({"col1":['a','b','c','d'],"col2":["potato","tomato","potato","tomato"],"col3":[1,1,1,1]})
df2 = pd.DataFrame({"col1":['g','b','q','r'],"col2":["potato","flowers","potato","flowers"],"col3":[1,1,1,1]})

## Hot encode col1
df,df2 = hot_encode_column_in_both_datasets("col1",df,df2)

len(df.columns) #9
len(df2.columns) #9

## Hot encode col2 as well
df,df2 = hot_encode_column_in_both_datasets("col2",df,df2)

Traceback (most recent call last):

  File "<ipython-input-44-d8e27874a25b>", line 1, in <module>
    df,df2 = hot_encode_column_in_both_datasets("col2",df,df2)

  File "<ipython-input-34-5ae1e71bbbd5>", line 331, in hot_encode_column_in_both_datasets
    df[column_name] = df[column_name].astype('category', categories=combined_cats,copy=False)

  File "/storage/programfiles/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 2419, in __setitem__
    self._set_item(key, value)

  File "/storage/programfiles/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 2485, in _set_item
    value = self._sanitize_column(key, value)

  File "/storage/programfiles/anaconda3/lib/python3.5/site-packages/pandas/sparse/frame.py", line 324, in _sanitize_column
    clean = value.reindex(self.index).as_sparse_array(

  File "/storage/programfiles/anaconda3/lib/python3.5/site-packages/pandas/sparse/series.py", line 573, in reindex
    return self.copy()

  File "/storage/programfiles/anaconda3/lib/python3.5/site-packages/pandas/sparse/series.py", line 555, in copy
    return self._constructor(new_data, sparse_index=self.sp_index,

  File "/storage/programfiles/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py", line 2744, in __getattr__
    return object.__getattribute__(self, name)

  File "/storage/programfiles/anaconda3/lib/python3.5/site-packages/pandas/sparse/series.py", line 242, in sp_index
    return self.block.sp_index

AttributeError: 'CategoricalBlock' object has no attribute 'sp_index'
Wboy
  • 2,452
  • 2
  • 24
  • 45
  • 1
    I don't think we can help you not being able to reproduce this issue. Can you provide a small reproducible data set? Please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and edit your post correspondingly. – MaxU - stand with Ukraine May 29 '17 at 17:06
  • @MaxU Understood, added a working example :) thank you! – Wboy May 30 '17 at 01:15
  • I've added an [answer](https://stackoverflow.com/a/44269213/5741205)- please check – MaxU - stand with Ukraine May 30 '17 at 18:25
  • The pandas sparse documentation says: `Any sparse object can be converted back to the standard dense form by calling to_dense:` (Don't confuse the pandas sparse implementation with the scipy one.) – hpaulj May 30 '17 at 19:31

1 Answers1

2

As i said before i would use CountVectorizer method in this case.

Demo:

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(vocabulary=np.union1d(df.col2, df2.col2))

r1 = pd.SparseDataFrame(cv.fit_transform(df.col2), 
                        columns=cv.get_feature_names(),
                        index=df.index, default_fill_value=0)

r2 = pd.SparseDataFrame(cv.fit_transform(df2.col2), 
                        columns=cv.get_feature_names(),
                        index=df2.index, default_fill_value=0)

NOTE: pd.SparseDataFrame(sparse_array) constructor is a new feature of Pandas 0.20.0, so we need Pandas 0.20.0+ for this solution

Result:

In [15]: r1
Out[15]:
   flowers  potato  tomato
0      0.0       1       0
1      0.0       0       1
2      0.0       1       0
3      0.0       0       1

In [16]: r2
Out[16]:
   flowers  potato  tomato
0        0       1     0.0
1        1       0     0.0
2        0       1     0.0
3        1       0     0.0

Pay attention at memory usage:

In [17]: r1.memory_usage()
Out[17]:
Index      80
flowers     0   # 0 * 8 bytes
potato     16   # 2 * 8 bytes (int64)
tomato     16   # ...
dtype: int64

In [18]: r2.memory_usage()
Out[18]:
Index      80
flowers    16   
potato     16
tomato      0   
dtype: int64
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • And then I'll have to concatenate this back into the main array right? Got it, thank you! :) – Wboy Jun 01 '17 at 00:52