I am doing the "What's Cooking?" Kaggle challenge and I have a dataframe which has three columns (note that the last one contains a list):
cuisine id ingredients
0 greek 10259 [romaine lettuce, black olives, grape tomatoes...
1 southern_us 25693 [plain flour, ground pepper, salt, tomatoes, g...
2 filipino 20130 [eggs, pepper, salt, mayonaise, cooking oil, g...
I wanted to calculate how often every ingredient is used in every cuisine. In order to do it, I splitted the last column and used it as a row in a new dataset. Additionally, I added the cuisine to every element of the splitted row. So if there was a row:
cuisine_1 id [ingredient_1, ingredient_2, ingredient_3]
the expected output which I wanted to get was:
cuisine_1 id ingredient_1
cuisine_1 id ingredient_2
cuisine_1 id ingredient_3
I wrote the following code which works correctly and transforms the dataset to the expected format:
ingredients = []
cuisines = []
ids = []
for _, row in train.iterrows():
cuisine = row.cuisine
identifier = row.id
for ingredient in row.ingredients:
cuisines.append(cuisine)
ingredients.append(ingredient)
ids.append(identifier)
ingredient_to_cuisine = pd.DataFrame({
"id": ids,
"ingredient": ingredients,
"cuisine": cuisines
})
It gets the job done, but that is a lot of code. Also, I had to dissect the dataset and rebuild it. It does not look like Pandas code. I feel I am reinventing the wheel.
My question is: Are there any built-in Pandas functions which I should use to get such results in a more "Pandas-friendly way"?