Pandas - how to add multiple conditional columns to dataframe?

Question

So what I'm trying to do is create several calculations based on conditions and add them to a dataframe. Here is some code as an example:

data = pd.DataFrame({'Project':  ['Project 1', 'Project 2', ' Project 3', 'Project 4', 'Project 5', 'Project 6', 'Project 7', 'Project 8', 'Project 9', 'Project 10'],
        'Date': ['10/1/2020', '10/1/2020', '11/1/2020', '12/1/2020', '12/1/2020', '12/1/2020', '2/1/2021', '2/1/2021', '3/1/2021', '4/1/2021'],
       'Team': ['Team 1', 'Team 2', 'Team 3', 'Team 2', 'Team 2', 'Team 2', 'Team 1', 'Team 1', 'Team 1', 'Team 1'],
       'Timely': ['Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes']})
                                       
df1=data.groupby('Team')['Timely'].apply(lambda x: (x=='Yes').sum()).reset_index(name='Yes Count')       

df1

Running this code will give me a table that shows the count of projects with timely = Yes for each team. Say I wanted to add a second column to the table to show the total count where timely = no and a third and fourth column to show the percent yes and percent no for each team. How would I do that?

Does this answer your question? [Apply multiple functions to multiple groupby columns](https://stackoverflow.com/questions/14529838/apply-multiple-functions-to-multiple-groupby-columns) — iff_or, Sep 28 '20 at 23:06
Almost. I tried using the function in the apply example on that page: ```def f(x): d = {} d['Timely'] = data['Timely'].apply(lambda x: (x=='Yes').sum()) d['Untimely'] = data['Timely'].apply(lambda x: (x=='No').sum()) return pd.Series(d, index=['Timely', 'Untimely']) df.groupby('Team').apply(f)``` but got the following error: ```'bool' object has no attribute 'sum'``` — hockeyhorror, Sep 28 '20 at 23:52

Cameron Riddell · Accepted Answer · 2020-10-01T16:08:20.850

Seems to me like you just want a crosstab. It's more straightforward to calculate using pd.crosstab than it is using a groupby/apply approach. Though pd.crosstab isn't the most efficient method of doing this, it works just fine.

This first snippet gets us the counts of "yes" and "no" for each team. We also get a "Total" column that is the sum of the "yes" + "no":

xtab = pd.crosstab(data["Team"], data["Timely"], margins=True, margins_name="Total")

print(xtab)
Timely  No  Yes  Total
Team                  
Team 1   1    4      5
Team 2   2    2      4
Team 3   1    0      1
Total    4    6     10

Now let's calculate the percentage breakdown of "No" and "Yes" by dividing those both by our "Total" column:

norm_xtab = xtab[["No", "Yes"]].div(xtab["Total"], axis=0)

print(norm_xtab)
Timely   No  Yes
Team            
Team 1  0.2  0.8
Team 2  0.5  0.5
Team 3  1.0  0.0
Total   0.4  0.6

Then we can combine those two dataframes horizontally with the pd.concat function. We can also give names to each "subdataframe" e.g. one chunk of the dataframe will only be for count data, the other chunk will be for the normalized/percentage based data:

out = pd.concat([xtab, norm_xtab], keys=["count", "percent"], axis=1)
out = out.rename(columns={"No": "untimely", "Yes": "timely"}, level=1)


print(out)
          count               percent       
Timely untimely timely Total untimely timely
Team                                        
Team 1        1      4     5      0.2    0.8
Team 2        2      2     4      0.5    0.5
Team 3        1      0     1      1.0    0.0
Total         4      6    10      0.4    0.6

The result of the concatenation is a DataFrame with a multiindex as the columns. If you are unfamiliar with pd.MultiIndex, you can use this snippet to flatten your result into something you may be more familiar with:

out.columns = ["{}_{}".format(*colname) for colname in out.columns]

print(out)

        count_untimely  count_timely  count_Total  percent_untimely  percent_timely
Team                                                                               
Team 1               1             4            5               0.2             0.8
Team 2               2             2            4               0.5             0.5
Team 3               1             0            1               1.0             0.0
Total                4             6           10               0.4             0.6

To change the column names as in your comment below, you can simply run rename one more time:

out = out.rename(columns=lambda colname: colname.replace("count_", ""))

print(out)
        untimely  timely  Total  percent_untimely  percent_timely
Team                                                             
Team 1         1       4      5               0.2             0.8
Team 2         2       2      4               0.5             0.5
Team 3         1       0      1               1.0             0.0
Total          4       6     10               0.4             0.6

To convert the decimals to whole percentages, you'll need to multiply by 100, then either round to 0 decimal places OR use string formatting to trim the trailing decimals (I'll show you how to do both), and add another string formatting to get the "%" to appear.

percentages = (out
               .filter(like="percent")      # select columns that contain the string "percent"
               .mul(100)                    # Multiply everything by 100
               .round(0)                    # If you don't want to perform any rounding, get rid of this line
               .applymap("{:.0f}%".format)) # Apply string formatting to trim the decimal.

print(percentages)
       percent_untimely percent_timely
Team                                  
Team 1              20%            80%
Team 2              50%            50%
Team 3             100%             0%
Total               40%            60%

If you want these values reflect back in our out dataframe:

out.loc[:, percentages.columns] = percentages

print(out)
        untimely  timely  Total percent_untimely percent_timely
Team                                                           
Team 1         1       4      5              20%            80%
Team 2         2       2      4              50%            50%
Team 3         1       0      1             100%             0%
Total          4       6     10              40%            60%

So when I try to create the second dataframe on my real dataset I get the following error: ```Unable to coerce to DataFrame, shape must be (8, 2): given (8, 1)``` Does this mean there are null values in the data set? And is there a limit on how many dataframes I can concatenate? There are two other metrics I need to include in this dataframe. — hockeyhorror, Sep 28 '20 at 23:16
Can you show me the code that leads to the error in your case? I'm not certain which piece of code you're referring to — Cameron Riddell, Sep 29 '20 at 08:20
I can't use the code from my real data set, but I get the same error with your code with this dataframe. The first xtab code works. It's the second norm_xtab code : ```norm_xtab = xtab[["No", "Yes"]] / xtab[["Total"]].values print(norm_xtab)``` that returns the following error: ```Unable to coerce to DataFrame, shape must be (4, 2): given (4, 1)``` — hockeyhorror, Sep 29 '20 at 13:55
Try this instead, my pandas version is `1.1.1` so a version difference might result in an error, this approach should be a little more robust. `norm_xtab = xtab[["No", "Yes"]].div(xtab["Total"], axis=0) — Cameron Riddell, Sep 29 '20 at 16:50
I have one final question. Is there a straightforward way to rename the columns in the combined crosstab and reorder the columns? So for example, instead of it saying count_no, count_yes, it would instead say untimely and timely? — hockeyhorror, Sep 29 '20 at 21:11
Yep! Just updated my answer. You can use the `rename(columns=...)` method to do this. Note that I used `rename` twice, once right after using `pd.concat` and once at the end to strip away the "count_" prefix. — Cameron Riddell, Sep 29 '20 at 21:43
You can concatenate any number of dataframes together, all you need is a list of dataframes. If you use `axis=1` it will stick them together horizontally, and `axis=0` sticks them together vertically. — Cameron Riddell, Sep 30 '20 at 17:05
Cool. Also, in your example I would like to convert the percentage to a whole number (ie., 20%"). I used the following code to do that: ```YOY = YOY.round(decimals=2) * 100 YOY = YOY.astype(str) + "%"```. But that results in trailing numbers after the decimal. How can I remove the decimal and trailing numbers? I'm having trouble doing that since this is a string with the % sign. — hockeyhorror, Oct 01 '20 at 13:45
I've updated my answer, let me know if it's what you wanted! — Cameron Riddell, Oct 01 '20 at 16:08

Pandas - how to add multiple conditional columns to dataframe?

1 Answers1