I feel for data structures such as this, you may have more performance if the data is wrangled outside Pandas, before returning into Pandas (of course this only matters if you care about performance, there is no need for unnecessary optimisation) - of course, tests are the only way to ensure this is True:
from collections import defaultdict
d = defaultdict(int)
for words, number in zip(df.words, df.category):
for word in words:
d[(word, number)] += 1
d
defaultdict(int,
{('cat', 1): 3,
('dog', 1): 2,
('mouse', 1): 1,
('mouse', 2): 1,
('cat', 2): 1,
('dog', 2): 1,
('elephant', 2): 1,
('elephant', 3): 2})
Build the DataFrame:
(pd.DataFrame(d.values(), index = d)
.unstack(fill_value = 0)
.droplevel(0, axis = 1)
)
1 2 3
cat 3 1 0
dog 2 1 0
elephant 0 1 2
mouse 1 1 0
Taking a cue from @HenryEcker, you could also use the Counter
function:
from itertools import product, chain
from collections import Counter
# integers are put into a list as `product` works on iterables
pairing = (product(left, [right])
for left, right
in zip(df.words, df.category))
outcome = Counter(chain.from_iterable(pairing))
outcome
Counter({('cat', 1): 3,
('dog', 1): 2,
('mouse', 1): 1,
('mouse', 2): 1,
('cat', 2): 1,
('dog', 2): 1,
('elephant', 2): 1,
('elephant', 3): 2})
Build the dataframe like before:
(pd.DataFrame(outcome.values(), index = outcome)
.unstack(fill_value = 0)
.droplevel(0, axis = 1)
)
1 2 3
cat 3 1 0
dog 2 1 0
elephant 0 1 2
mouse 1 1 0