1

I have for each row id, a list of values as a pandas column. the structure is as follows:

df = {'id1':[['a','b','c','d']],'id2':[['a','d','e','j']],'id3':[['b','d','i','q']]},
df = pd.DataFrame.from_dict(df,orient='index')

which gives me:

dataset_example

At first I've created on the side a set of unique values, using this code:

l = df.values.tolist()
flat_set = {item for sublist in l for item in sublist}

at the end, I need to get a sparse version of this:

enter image description here

Notes:

  1. no. of unique values in the set - 100K~
  2. no. of ids - 60K~

I don't mind keeping a dict on the side if shortening the names of the columns leads to reduced memory, but the unpacking from list, to sparse is the hard part, for me.

Please help :)

Talis
  • 283
  • 3
  • 13

1 Answers1

5

Use MultiLabelBinarizer with DataFrame constructor:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df[0]),columns=mlb.classes_, index=df.index)
print (df)
     a  b  c  d  e  i  j  q
id1  1  1  1  1  0  0  0  0
id2  1  0  0  1  1  0  1  0
id3  0  1  0  1  0  1  0  1

EDIT: For sparse DataFrame add sparse_output=True to MultiLabelBinarizer and use DataFrame.sparse.from_spmatrix

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer(sparse_output=True)
a = mlb.fit_transform(df[0])

df = df = pd.DataFrame.sparse.from_spmatrix(a, columns=mlb.classes_, index=df.index)
print (df)
     a  b  c  d  e  i  j  q
id1  1  1  1  1  0  0  0  0
id2  1  0  0  1  1  0  1  0
id3  0  1  0  1  0  1  0  1

print (df.dtypes)
a    Sparse[int32, 0]
b    Sparse[int32, 0]
c    Sparse[int32, 0]
d    Sparse[int32, 0]
e    Sparse[int32, 0]
i    Sparse[int32, 0]
j    Sparse[int32, 0]
q    Sparse[int32, 0]
dtype: object
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252