How to groupby one or two columns and sum 100 columns

Question

*updated original question

Example code:

import pandas as pd
df = pd.DataFrame({'Weight': [1.2, 2.0, 1.8,2.4,1.9,2.3], 
                   'Sex': ['Male', 'Female', 'Unknown','Male','Male','Female'],
                  'Neutered': ['Entire', 'Unknown', 'Neutered','Neutered','Neutered','Unknown'],
                  'Rabbit_Breed': ['Dutch', 'Lop', 'Dwarf','Giant','Cross-Breed','Dwarf'],
                  'Abscess-mouth': [0, 0, 1,0,0,0],
                  'Overweight': [0, 1, 0,1,0,1],
                  'underweight': [0, 0, 1,0,0,1],
                  'molars-long': [1, 0, 1,0,0,1]})


df.head()

NB: I have around 100 columns so I cannot list them all; I'm looking for a way to groupby and or sum through all the columns to have the most common disorders in relation to the breed or sex of a rabbit.

I've attached an image of my thought process:

original question: I'm looking to groupby one or two columns and sum all the other columns. Not sure if I should use a range or what but I keep getting errors.

Unless I've misunderstood the purpose of groupby and sum. I've got about 100 columns of disorders in domestic rabbits and ultimately I'm trying to investigate the most common ones and plot them against breed or female/male etc.

Thank you!!

Please don’t post images of the data as we can’t test them. Instead, create a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example), sharing a sample of the DataFrame(s) and the expected output inside a code block together with the description of the problem. This allows us to easily reproduce your problem and help you. These should help: [how to ask a good question](https://stackoverflow.com/help/how-to-ask) and [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). — Rodalm, May 29 '22 at 16:48
Unclear what you are looking for but `sum()` will sum all the values in a column for every group. Maybe try using `count()`, it counts all the values that are not None in a column for every group. — Ilya, May 29 '22 at 16:51
I've added the code. Thanks for the links. I'm new to posting on stack overflow :) — Fernandes92, May 29 '22 at 17:11
@LauraFernandes You're welcome. Just so you know, you have to tag people with `@` in the comments for them to be notified. That's better, but it's not clear what the expected output should be in your example. And it is better if you include in your example more than one record per group, so it's clear that you want to calculate things per group. — Rodalm, May 29 '22 at 17:43
@Rodalm Thanks for the tip. I chose a dataset that I found online because I was interested in diet related illnesses in pet rabbits: https://rvc-repository.worktribe.com/output/1382109/morbidity-and-mortality-of-domestic-rabbits-oryctolagus-cuniculus-under-primary-veterinary-care-in-england-in-2013 .The dataset has many illnesses and it's more of a tally. — Fernandes92, May 29 '22 at 17:54
@Rodalm ^^ continuing from above. Looking at the data I can see for instance, one of the disorders ( overgrown/ long-claws ) totals 56. I feel like I'm working with an awkward dataset and have tried many different things like transpose or appending a column of row sums but it got complicated when it came to plotting as I had to keep using pd.numeric errors=coerce. After spending the whole weekend working on it, I'm just looking for alternative ways of plotting the common illnesses in pet rabbits. I thought an image would provide more insight as I can't code it all. Thanks — Fernandes92, May 29 '22 at 17:54
Well, I will insist again. Please create a minimal reproducible example (see the links above), with a sample of the DataFrame and the output you expect to get, along with a description of the problem and the logic behind the output, so that it is totally clear. Edit your post with that new info, otherwise, I can't help you. Trying to explain the desired output in words alone gets confusing and may lead to misunderstandings, and you end up wasting other people's time. Don't assume that other people can figure out what's on your mind based only on a brief and unclear explanation of the problem. — Rodalm, May 29 '22 at 18:23
@Rodalm Thanks, I appreciate your input. I'm trying my best to explain my question. — Fernandes92, May 29 '22 at 18:46
It is clear now, thanks! Is my solution what you were looking for? — Rodalm, May 29 '22 at 19:20

score 1 · Answer 1 · answered May 29 '22 at 19:18

1

Plotting a histogram doesn't make much sense to me, since you want to plot bivariate data (disorder vs. breed), while histograms are meant for univariate data. I think you want a heatmap, which is basically the generalization of a histogram for two dimensions. For that, you can use seaborn.heatmap.

Is this what you want?

import pandas 
import seaborn as sns

df = pd.DataFrame({'Weight': [1.2, 2.0, 1.8,2.4,1.9,2.3], 
                   'Sex': ['Male', 'Female', 'Unknown','Male','Male','Female'],
                  'Neutered': ['Entire', 'Unknown', 'Neutered','Neutered','Neutered','Unknown'],
                  'Rabbit_Breed': ['Dutch', 'Lop', 'Dwarf','Giant','Cross-Breed','Dwarf'],
                  'Abscess-mouth': [0, 0, 1,0,0,0],
                  'Overweight': [0, 1, 0,1,0,1],
                  'underweight': [0, 0, 1,0,0,1],
                  'molars-long': [1, 0, 1,0,0,1]})

disorder_cols = ['Abscess-mouth', 'Overweight', 'underweight', 'molars-long']

disorder_by_breed = df.groupby('Rabbit_Breed')[disorder_cols].sum()

sns.heatmap(data=disorder_by_breed, annot=True, lw=1, cmap='Reds')

Output:

answered May 29 '22 at 19:18

Rodalm

5,169
5
21

1

Thank you so much! Yes any kind of plot where I can show the relationship and draw conclusions from is helpful. Thanks again! :) – Fernandes92 May 29 '22 at 20:48
No worries, It was complicated, but I'm glad I was able to help in the end ;) If you find the answer helpful, please consider [marking it as accepted](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work). – Rodalm May 29 '22 at 21:29
thank you, to build on your example where you have disorder_cols = ['Abscess-mouth', 'Overweight', 'underweight', 'molars-long'] do you know how I can select an index of columns? I couldn't list them all as there's over 100 but if I want around 30 columns how would I do that please? – Fernandes92 May 30 '22 at 12:05
@Fernandes92 You can select a range of columns by index using `disorder_cols=df.columns[start_col_idx:end_col_idx]`, similarly as you slice lists – Rodalm Jun 01 '22 at 17:56

How to groupby one or two columns and sum 100 columns

1 Answers1

Linked