How to turn a column of lists in pandas to a sparse DataFrame of the unique values in Python

Question

I have for each row id, a list of values as a pandas column. the structure is as follows:

df = {'id1':[['a','b','c','d']],'id2':[['a','d','e','j']],'id3':[['b','d','i','q']]},
df = pd.DataFrame.from_dict(df,orient='index')

which gives me:

At first I've created on the side a set of unique values, using this code:

l = df.values.tolist()
flat_set = {item for sublist in l for item in sublist}

at the end, I need to get a sparse version of this:

Notes:

no. of unique values in the set - 100K~
no. of ids - 60K~

I don't mind keeping a dict on the side if shortening the names of the columns leads to reduced memory, but the unpacking from list, to sparse is the hard part, for me.

Please help :)

jezrael · Answer 1 · 2019-12-09T08:59:29.310

5

Use MultiLabelBinarizer with DataFrame constructor:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df[0]),columns=mlb.classes_, index=df.index)
print (df)
     a  b  c  d  e  i  j  q
id1  1  1  1  1  0  0  0  0
id2  1  0  0  1  1  0  1  0
id3  0  1  0  1  0  1  0  1

EDIT: For sparse DataFrame add sparse_output=True to MultiLabelBinarizer and use DataFrame.sparse.from_spmatrix

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer(sparse_output=True)
a = mlb.fit_transform(df[0])

df = df = pd.DataFrame.sparse.from_spmatrix(a, columns=mlb.classes_, index=df.index)
print (df)
     a  b  c  d  e  i  j  q
id1  1  1  1  1  0  0  0  0
id2  1  0  0  1  1  0  1  0
id3  0  1  0  1  0  1  0  1

print (df.dtypes)
a    Sparse[int32, 0]
b    Sparse[int32, 0]
c    Sparse[int32, 0]
d    Sparse[int32, 0]
e    Sparse[int32, 0]
i    Sparse[int32, 0]
j    Sparse[int32, 0]
q    Sparse[int32, 0]
dtype: object

edited Dec 09 '19 at 08:59

answered Dec 09 '19 at 08:38

jezrael

822,522
95
1,334
1,252

what do you refer by df[0]? – Talis Dec 09 '19 at 08:46
1

@Talis - it is column name `0`, maybe need `df['package']` – jezrael Dec 09 '19 at 08:47
worked, but, can i force it to be sparse upon declaring df?, here: df = pd.DataFrame(mlb.fit_transform(df[0]),columns=mlb.classes_, index=df.index) – Talis Dec 09 '19 at 08:48
@Talis - answer was edited. – jezrael Dec 09 '19 at 09:00

How to turn a column of lists in pandas to a sparse DataFrame of the unique values in Python

1 Answers1