Pandas dataframe - Sum a column wrt to values in another column

Question

I have a data that looks like this :-

data = {"doc1" : {'a': 2 , 'b': 1,'c':3}, "doc2" :  {'a': 1 , 'b': 1,'c':3}, "doc3" : {'a': 1 , 'b': 1,'c':3}}

I convert it into a dataframe :-

df = pd.DataFrame.from_dict(data,orient='index')

Dataframe looks like this :-

a c b doc1 2 3 1 doc2 1 3 1 doc3 1 3 1

Now I want to sum all the values in column b where column a values is 1.

So the value I want will be 2.

Is there an easy way to do this rather than iterating through both the columns ? I checked other posts and found this :-

This makes use of .loc function. df.loc[df['a'] == 1, 'b'].sum()

But for some reason, I can't seem to make it to work with my dataframe.

Please let me know.

Thanks.

"But for some reason, I can't seem to make it to work with my dataframe." Could you give detailed description of "some reason"? The answer you find should work. — seven7e, Nov 30 '16 at 02:36
Hey,I made it to work with the below answer. But my code was also doing the same thing. Only I messing up the variables. Due to that, it was failing. Thanks a ton !! — skyfail, Dec 02 '16 at 05:54

blacksite · Accepted Answer · 2016-11-30T02:43:00.840

1

You are very close. See below.

>>> df[df['a'] == 1]['b'].sum()
2

Instead of using .loc, try just filtering the dataframe first (df[df['a'] == 1]), then selecting the column 'b', and then summing.

Edit: I'll leave this here for future reference, although depending on the version of pandas you're using, your solution should work (thanks, @maxymoo). I'm running 0.18.1 and both approaches worked.

edited Nov 30 '16 at 02:43

answered Nov 30 '16 at 02:41

blacksite

12,086
10
64
109

1

actually this should be equivalent to the OP's code ... although maybe it's only in the latest version of pandas ? – maxymoo Nov 30 '16 at 02:42
Hmmm. You're right. I'm running `pandas 0.18.1`, and his solution still worked. – blacksite Nov 30 '16 at 02:43
Hey thanks for the quick reply. I tried the code and it works fine but my dataframe is pretty huge with 10k columns and over 2k rows. I need to calculate the sum many times. When I used the above code, it took a lot of time. Any faster way to do this ? – skyfail Nov 30 '16 at 05:57
That's almost a different problem. How many different values are you trying to filter on? How many columns are you trying to sum? The above example should run very quickly with only 2k rows and one column filter. It sounds like a groupby method is what you're looking for. – blacksite Nov 30 '16 at 12:37
I am doing the exact same thing that I mentioned above but on a larger scale. Going through 2k rows and iterating through all columns which are like 30k but only considering one column at a time wrt to another column. – skyfail Nov 30 '16 at 16:33
What's the purpose of your analysis? From your `doc1`, `doc2`, etc. labels, it seems like you're doing some sort of text analysis using word frequencies. If so, might I suggest `sklearn.feature_extraction.text.CountVectorizer`? Since that returns a numpy array, you'll be able to perform these sums *much* faster. – blacksite Nov 30 '16 at 18:00

Pandas dataframe - Sum a column wrt to values in another column

1 Answers1