Group by from wide form in Pandas

Question

I have a DataFrame like this one:

customer_type   age    satisfaction    design  food    wifi    service    distance
        Loyal    28   Not Satisfied         0    1        2          2        13.5
        Loyal    55       Satisfied         5    3        5          4        34.2
     Disloyal    36   Not Satisfied         2    0        2          4        55.8
     Disloyal    28   Not Satisfied         3    1        2          2        13.5
     Disloyal    33   Not Satisfied         2    1        2          2        13.5
     Disloyal    35   Not Satisfied         2    1        2          2        13.5
     Disloyal    39   Not Satisfied         1    1        2          2        13.5
     Disloyal    31   Not Satisfied         2    1        2          2        13.5
        Loyal    28   Not Satisfied         0    1        2          2        13.5
     Disloyal    31   Not Satisfied         2    1        2          2        13.5
     Disloyal    40   Not Satisfied         2    1        2          2        13.5
     Disloyal    35   Not Satisfied         2    1        2          2        13.5
     Disloyal    35   Not Satisfied         2    2        2          2        13.5

I want to find out the characteristics of the Disloyal and Not Satisfied customers that are between 30 and 40 years old, grouping them by the service they have rated:

 service   ratings_count    age   age_count   population_pct
  design               8    40    1           7.69
                            36    1           7.69
                            35    3           23.07
                            33    1           7.69
                            31    2           15.38
    food               1    35    1           7.69

I suspect I have to use melt but I can't figure out how to groupby from there.

If possible, I'd like to be able to know where I can customize the groupby by adding columns. For example, if I wanted to include the `Distance` of those customers as well. That's just a bonus though. Don't worry if it's too much trouble. — Edison, Jun 16 '22 at 08:28

Laurent · Accepted Answer · 2022-06-20T14:14:19.340

With the following toy dataframe, inspired by yours but a bit more heterogeneous:

import pandas as pd

df = pd.DataFrame(
    {
        "customer_type": [
            "Loyal",
            "Loyal",
            "Disloyal",
            "Disloyal",
            "Loyal",
            "Disloyal",
            "Disloyal",
            "Disloyal",
            "Loyal",
            "Disloyal",
            "Disloyal",
            "Loyal",
            "Disloyal",
        ],
        "age": [28, 55, 27, 31, 42, 35, 39, 31, 28, 51, 40, 35, 35],
        "satisfaction": [
            "Not Satisfied",
            "Satisfied",
            "Not Satisfied",
            "Not Satisfied",
            "Not Satisfied",
            "Not Satisfied",
            "Not Satisfied",
            "Not Satisfied",
            "Not Satisfied",
            "Satisfied",
            "Not Satisfied",
            "Not Satisfied",
            "Satisfied",
        ],
        "design": [0, 5, 2, 4, 2, 2, 1, 0, 1, 2, 0, 2, 1],
        "food": [1, 3, 0, 0, 1, 0, 1, 0, 1, 0, 1, 3, 2],
        "wifi": [2, 5, 2, 1, 2, 2, 0, 0, 1, 2, 2, 0, 2],
        "service": [2, 4, 1, 2, 2, 2, 0, 2, 3, 0, 1, 2, 2],
        "distance": [
            13.5,
            34.2,
            55.8,
            21.2,
            23.5,
            13.5,
            9.0,
            13.5,
            13.5,
            22.7,
            13.5,
            13.5,
            18.4,
        ],
    }
)

Here is one less subtel, although easier, way to do it:

# Setup
amenities = ["design", "food", "wifi", "service"]
dfs = []
new_df = df[
    df["age"].between(30, 40)
    & (df["customer_type"] == "Disloyal")
    & (df["satisfaction"] == "Not Satisfied")
]

# Iterate on each amenity, groupby age and add other columns/values
for amenity in amenities:
    temp_df = (
        new_df[new_df[amenity] > 0]
        .groupby("age")
        .agg({amenity: "count", "distance": "mean"})
        .reset_index(drop=False)
        .rename(columns={amenity: "age_count", "distance": "distance_mean"})
        .pipe(
            lambda df_: df_.assign(
                population_pct=(100 * df_["age_count"] / df_["age_count"].sum()).round(
                    2
                )
            )
        )
    )
    temp_df.loc[0, "amenities"] = amenity
    temp_df.loc[0, "ratings_count"] = temp_df["age_count"].sum()
    temp_df = pd.concat(
        [
            temp_df,
            new_df[(new_df[amenity] != 0)]
            .sort_values(by=["age"])
            .groupby("age")
            .agg({amenity: list})
            .reset_index(drop=True),
        ],
        axis=1,
    )
    temp_df = temp_df.rename(columns={amenity: "ratings"})
    temp_df["ratings"] = temp_df["ratings"].apply(lambda x: x[0] if len(x) == 1 else x)
    dfs.append(temp_df)

# Get final dataframe and cleanup
new_df = (
    pd.concat(dfs)
    .fillna(method="ffill")
    .reindex(
        columns=[
            "amenities",
            "ratings_count",
            "age",
            "age_count",
            "ratings",
            "distance_mean",
            "population_pct",
        ]
    )
    .astype({"ratings_count": "int"})
    .set_index(["amenities", "ratings_count"])
)

So that:

print(new_df)
# Output
                         age  age_count ratings  distance_mean  population_pct
amenities ratings_count                                                       
design    3               31          1       4          21.20           33.33
          3               35          1       2          13.50           33.33
          3               39          1       1           9.00           33.33
food      2               39          1       1           9.00           50.00
          2               40          1       1          13.50           50.00
wifi      3               31          1       1          21.20           33.33
          3               35          1       2          13.50           33.33
          3               40          1       2          13.50           33.33
service   4               31          2  [2, 2]          17.35           50.00
          4               35          1       2          13.50           25.00
          4               40          1       1          13.50           25.00

From here, you can filter ratings_count <=2 like this:

print(new_df[new_df.index.get_level_values(1) <= 2])
# Output
                         age  age_count ratings  distance_mean  population_pct
amenities ratings_count                                                       
food      2               39          1       1            9.0            50.0
          2               40          1       1           13.5            50.0

The groupby is great. Thank you. However, the `get_level_values` query only returns the column names, no data. Importantly, we have the counts, but it would be great if rankings/ratings were actually in the [output](https://stackoverflow.com/questions/72629849/sum-groupby-column-values-returned-by-value-counts/72629933#72629933). `ratings | rating_counts | age | age_counts`. So we'd have the main output plus the additional query. — Edison, Jun 20 '22 at 12:35
Upvoted for the great answer. Appreciate it. Never knew about the `get_level_values` method. Not sure why I don't get anything returned with that. Btw, what's your recommendation? What would you have done? Start with zero filters and query who ranked which products between e.g. `1 and 3`. That gives us the age range, amenities and number of ratings per age which is the most useful info. In that case, I simply remove the (30,40) age filter. Btw this is a Likert Scale so `0 = "does not apply"`. e.g. maybe the customer did not order food etc. — Edison, Jun 21 '22 at 09:23
I guess in a real-case, there would be no filters and it would be the query that tells me all the attributes of those who ranked what products between 1-3. Then it's the query that tells me if they were loyal, old, satisfied etc instead of me setting a filter. I'll try that out on my own using your template for practice. — Edison, Jun 21 '22 at 09:34
Btw, I ran only `[new_df.index.get_level_values(1) <= 2]` to test and all the values were `false`. I guess that's why I'm not getting anything back using `new_df[new_df.index.get_level_values(1) <= 2]` which is weird because I'm looking right at values less than 2. — Edison, Jun 21 '22 at 09:50
It looks like [this](https://i.imgur.com/H8Fot56.png). Abstracted out for brevity but the main code block executes so it shouldn't be an issue. Not sure why `[new_df.index.get_level_values(1) <= 2]` returned `false, false, false, false`. — Edison, Jun 21 '22 at 12:24
Ok, then perhaps the output of `print(new_df.index.get_level_values(1).unique())` will be more helpful? — Laurent, Jun 21 '22 at 12:45
When I do that I get `Int64Index([8828, 7686, 8701, 8825, 8719], dtype='int64', name='ratings_count')`. — Edison, Jun 22 '22 at 02:47
So, this means that `new_df.index.get_level_values(1)` can only return one of those five values, since they are the only ones that can be found in the second level of the index. As none of them is less or equal than 2, the output of [new_df.index.get_level_values(1) <= 2] can only be [false, false, false, false], and thus slincing with this array returns an empty dataframe. Cheers. — Laurent, Jun 22 '22 at 12:16

Group by from wide form in Pandas

1 Answers1

Linked