Hej, I'm trying to vectorize items that can belong to multiple categories and put them into a pandas dataframe. I already came up with a solution but it's very slow. So here's what I'm doing:
That's how my data looks like:
data = {
'A':['c1','c2','c3'],
'B':['c4','c5','c2'],
'C':['c2','c1','c4']
}
I have three items (A-C) that belong to five different categories (c1-c5).
So I create a an empty dataframe, iterate over the items turn them into boolean Series objects with the right index and append them:
df = pd.SparseDataFrame()
for k, v in data.items():
s = pd.Series(np.ones_like(v, dtype=bool), index=v, name=k)
df = df.append(s)
My result looks like this:
I'm happy with this result but my real data has ~200k categories which makes this approach horribly slow. Do you have any suggestions how to speed up?
Remark: Extracting all categories and passing them as columns into the empty Dataframe doesn't help:
df = pd.SparseDataFrame(columns=all_categories)