3

Hej, I'm trying to vectorize items that can belong to multiple categories and put them into a pandas dataframe. I already came up with a solution but it's very slow. So here's what I'm doing:

That's how my data looks like:

data = {
    'A':['c1','c2','c3'],
    'B':['c4','c5','c2'],
    'C':['c2','c1','c4']
}

I have three items (A-C) that belong to five different categories (c1-c5).

So I create a an empty dataframe, iterate over the items turn them into boolean Series objects with the right index and append them:

df = pd.SparseDataFrame()
for k, v in data.items():
    s = pd.Series(np.ones_like(v, dtype=bool), index=v, name=k)
    df = df.append(s)

My result looks like this:

Resulting Dataframe

I'm happy with this result but my real data has ~200k categories which makes this approach horribly slow. Do you have any suggestions how to speed up?

Remark: Extracting all categories and passing them as columns into the empty Dataframe doesn't help:

df = pd.SparseDataFrame(columns=all_categories)
nadre
  • 507
  • 1
  • 4
  • 17

2 Answers2

1

You could try (source):

df = pd.DataFrame.from_dict(data, orient='index')

and then

res = df.stack().reset_index().groupby(['level_0', 0]).size().unstack()

Finally you can cast the output to a SparseDataFrame

result = pd.SparseDataFrame(res)

With df.stack() you have :

A  0    c1
   1    c2
   2    c3
B  0    c4
   1    c5
   2    c2
C  0    c2
   1    c1
   2    c4

Then you do the reset_index:

  level_0  level_1   0
0       A        0  c1
1       A        1  c2
2       A        2  c3
3       B        0  c4
4       B        1  c5
5       B        2  c2
6       C        0  c2
7       C        1  c1
8       C        2  c4

You could change the column names to be cleaner. The group by method make the count :

level_0  0
A        c1    1
         c2    1
         c3    1
B        c2    1
         c4    1
         c5    1
C        c1    1
         c2    1
         c4    1

And finally :

0         c1   c2   c3   c4   c5
level_0
A        1.0  1.0  1.0  NaN  NaN
B        NaN  1.0  NaN  1.0  1.0
C        1.0  1.0  NaN  1.0  NaN
Till
  • 4,183
  • 3
  • 16
  • 18
1

Consider the following memory saving approach:

In [143]: df = pd.DataFrame([' '.join(data[k]) for k in data.keys()],
                            index=data.keys(),
                            columns=['text'])

In [144]: df
Out[144]:
       text
C  c2 c1 c4
A  c1 c2 c3
B  c4 c5 c2

In [145]: from sklearn.feature_extraction.text import CountVectorizer

In [146]: cv = CountVectorizer()

In [147]: df = pd.SparseDataFrame(cv.fit_transform(df['text']),
                                  columns=cv.get_feature_names(),
                                  index=df.index)

In [148]: df
Out[148]:
    c1  c2   c3   c4   c5
C  1.0   1  NaN  1.0  NaN
A  1.0   1  1.0  NaN  NaN
B  NaN   1  NaN  1.0  1.0


In [149]: df.memory_usage()
Out[149]:
Index    80
c1       16
c2       24
c3        8
c4       16
c5        8
dtype: int64
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • This idea looks awesome, I also thought about joining the categories and using a sklearn Vectorizer. Unfortunately I'm facing a strange error when passing the fitted csr matrix to the SparseDataFrame constructor: File "/home/user/anaconda3/lib/python3.6/site-packages/pandas/sparse/frame.py", line 125, in __init__ generic.NDFrame.__init__(self, mgr) UnboundLocalError: local variable 'mgr' referenced before assignment – nadre Jun 07 '17 at 11:20
  • @nadre, sorry, i forgot to mention `pd.SparseDataFrame(sparse_array)` constructor has been added in Pandas v. 0.20, so it should work for Pandas 0.20+ versions – MaxU - stand with Ukraine Jun 07 '17 at 11:25
  • @nadre, if you can't update your Pandas version, you can use [the following approach](https://stackoverflow.com/questions/41916560/pandas-dataframe-memory-python/41916672#41916672) – MaxU - stand with Ukraine Jun 07 '17 at 11:27
  • Yes. That works, thank you! But constructing a SparseDataFrame from a CSR Matrix still takes a lot of time. :( – nadre Jun 07 '17 at 11:55