0

Let's say I have this dataframe,

df = pd.DataFrame([['a', 'b', 'c'], 
                   ['1', '2', '3'], 
                   ['4', '5', '6']],
                  index=['A', 'B', 'C'], 
                  columns=['x', 'y', 'z'])

    x   y   z
A   a   b   c
B   1   2   3
C   4   5   6

I saw the code, df.groupby('x')['y']. In here, what does ['y'] do? I understand ('x').
Thanks in advance!

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
jayko03
  • 2,329
  • 7
  • 28
  • 51
  • 2
    it returns the column `y` – Nicolas Gervais Dec 15 '19 at 03:48
  • would you like too look [this answer](https://stackoverflow.com/a/53781645/8333806). Merci – abdoulsn Dec 15 '19 at 03:54
  • @NicolasGervais It returns `pandas.core.groupby.generic.SeriesGroupBy object`. – jayko03 Dec 15 '19 at 03:56
  • here,`('x')` is used for DataFrameGroupBy whereas `['y']` is used for SeriesGroupBy in pandas – Joy Dec 15 '19 at 03:57
  • 1
    `df.groupby('x')` groups on col `x` while `df.groupby('x')['y']` <- this would make a function operate on col `y` after grouping on `x` , eg `df.groupby('x')['y'].sum()` would give sum on `y` after grouping on `x` however `df.groupby('x').sum()` would return sum of all columns (not only y) after grouping on x. – anky Dec 15 '19 at 04:32

2 Answers2

2

The new index is the new group you made with groupby(). The ['y'] will return the column y. But, you also need to call a function on your aggregated rows, like sum(). Here's an example:

import pandas as pd

df = pd.DataFrame({'Name':['Mark', 'Laura', 'Adam', 'Roger', 'Anna'],
                   'City':['Lisbon', 'Montreal', 'Lisbon', 'Berlin', 'Glasgow'],
                   'Height':[173.4, 151.8, 179.3, 169.1, 166.4]})
print(df)
    Name      City  Height
0   Mark    Lisbon   173.4
1  Laura  Montreal   151.8
2   Adam    Lisbon   179.3
3  Roger    Berlin   169.1
4   Anna   Glasgow   166.4

Return the sum of the people, grouped by the City:

df.groupby('City').sum()['Height']
Out[46]: 
City
Berlin      169.1
Glasgow     166.4
Lisbon      352.7
Montreal    151.8
Name: Height, dtype: float64

The new index is the group, and you selected one column to display. You can either put it before or after sum().

Nicolas Gervais
  • 33,817
  • 13
  • 115
  • 143
  • If you group values, you need to tell pandas how do you want them grouped. By mean? By sum? Because that's what aggregation is – Nicolas Gervais Dec 15 '19 at 04:09
  • @MadPhysicist Have you read the Pandas docs? It sounds like you need a tutorial or guide, Stack Overflow isn’t really meant for this. Also, I think the code you shared is already doing that. – AMC Dec 15 '19 at 04:16
  • I'm not OP. Just a heckler from the sidelines. I'll take your advice regardless. – Mad Physicist Dec 15 '19 at 04:19
  • 1
    OK. I've looked through the docs. I see nothing that indicates that you have to aggregate after grouping. – Mad Physicist Dec 15 '19 at 04:22
  • From the docs: `DataFrameGroupBy` Returns: Depends on the __calling object__ and returns groupby object that contains information about the groups. – Nicolas Gervais Dec 15 '19 at 04:25
  • @MadPhysicist Oops, sorry for assuming your were OP! Indeed, I also assumed you could just get the column directly, no need to call anything. Unfortunately I’m not at my computer right now so I can’t check. – AMC Dec 15 '19 at 04:28
  • @NicolasGervais Would managing to print the output of a groupby like the one in the OP (without aggregation or a function) constitute enough proof? – AMC Dec 15 '19 at 04:41
  • `print(list(df.groupby('x')['y']))` where `df` is the one from the OP. (I see you’re in Montreal too, hi!) – AMC Dec 15 '19 at 04:44
  • @AlexanderCécile What is OP? – jayko03 Dec 15 '19 at 16:50
  • @jayko03 OP = original poster – AMC Dec 15 '19 at 18:54
0
groupby() 

created a group of df which allotted the same x values to the given rows. Then, for each of these groups, you grabbe the y column and counted how many times it appeared. It's like value_counts() (a shortcut to this groupby() operation).

abdoulsn
  • 842
  • 2
  • 16
  • 32