1

I am doing the "What's Cooking?" Kaggle challenge and I have a dataframe which has three columns (note that the last one contains a list):

    cuisine id  ingredients
    0   greek   10259   [romaine lettuce, black olives, grape tomatoes...
    1   southern_us 25693   [plain flour, ground pepper, salt, tomatoes, g...
    2   filipino    20130   [eggs, pepper, salt, mayonaise, cooking oil, g...

I wanted to calculate how often every ingredient is used in every cuisine. In order to do it, I splitted the last column and used it as a row in a new dataset. Additionally, I added the cuisine to every element of the splitted row. So if there was a row:

cuisine_1    id   [ingredient_1, ingredient_2, ingredient_3]

the expected output which I wanted to get was:

cuisine_1    id   ingredient_1
cuisine_1    id   ingredient_2
cuisine_1    id   ingredient_3

I wrote the following code which works correctly and transforms the dataset to the expected format:

ingredients = []
cuisines = []
ids = []
for _, row in train.iterrows():
    cuisine = row.cuisine
    identifier = row.id
    for ingredient in row.ingredients:
        cuisines.append(cuisine)
        ingredients.append(ingredient)
        ids.append(identifier)

ingredient_to_cuisine = pd.DataFrame({
    "id": ids,
    "ingredient": ingredients,
    "cuisine": cuisines
})

It gets the job done, but that is a lot of code. Also, I had to dissect the dataset and rebuild it. It does not look like Pandas code. I feel I am reinventing the wheel.

My question is: Are there any built-in Pandas functions which I should use to get such results in a more "Pandas-friendly way"?

Bartosz Mikulski
  • 525
  • 3
  • 17

0 Answers0