I have a data frame that takes this form:
import pandas as pd
dict = {'id':["1001", "1001", "1001", "1002", "1002", "1002", "1003", "1003", "1003"],
'food': ["apple", "ham", "egg", "apple", "pear", "cherry", "cheese", "milk", "cereal"],
'fruit':[1, 0, 0, 1, 1, 1, 0, 0, 0],
'score':[1, 3, 1, 1, 1, 1, 2, 2, 3]}
df = pd.DataFrame(dict)
id food fruit score
0 1001 apple 1 1
1 1001 ham 0 0
2 1001 egg 0 0
3 1002 apple 1 1
4 1002 pear 1 2
5 1002 cherry 1 3
6 1003 cheese 0 0
7 1003 cherry 1 3
8 1003 cheese 0 0
I would like to create a new data frame that has one row for a single participant (i.e., same id) and then columns for custom summaries of the data, e.g.:
- number of unique foods
- number of total fruits
- total score
- etc.
Example output:
id unique fruits score
0 1001 3 1 1
1 1002 3 3 6
2 1003 2 1 3
I could create a new empty data frame and then iterate over the unique id's in the old data frame, using logical indexing to fill the columns. But my data frame has about 50x10^6 rows and ~200,000 unique id's so this would take extremely long. I have read that iterating over the rows of a data frame is inefficient, but I don't know how to apply alternative solutions to my dataset.
Thanks.