1

Data:

{"Survived":{"0":0,"1":1,"2":1,"3":1,"4":0,"5":0,"6":0,"7":0,"8":1,"9":1,"10":1,"11":1,"12":0,"13":0,"14":0,"15":1,"16":0,"17":1,"18":0,"19":1,"20":0,"21":1,"22":1,"23":1,"24":0,"25":1,"26":0,"27":0,"28":1,"29":0,"30":0,"31":1,"32":1,"33":0,"34":0,"35":0,"36":1,"37":0,"38":0,"39":1,"40":0,"41":0,"42":0,"43":1,"44":1,"45":0,"46":0,"47":1,"48":0,"49":0},"Pclass":{"0":3,"1":1,"2":3,"3":1,"4":3,"5":3,"6":1,"7":3,"8":3,"9":2,"10":3,"11":1,"12":3,"13":3,"14":3,"15":2,"16":3,"17":2,"18":3,"19":3,"20":2,"21":2,"22":3,"23":1,"24":3,"25":3,"26":3,"27":1,"28":3,"29":3,"30":1,"31":1,"32":3,"33":2,"34":1,"35":1,"36":3,"37":3,"38":3,"39":3,"40":3,"41":2,"42":3,"43":2,"44":3,"45":3,"46":3,"47":3,"48":3,"49":3},"Sex":{"0":"male","1":"female","2":"female","3":"female","4":"male","5":"male","6":"male","7":"male","8":"female","9":"female","10":"female","11":"female","12":"male","13":"male","14":"female","15":"female","16":"male","17":"male","18":"female","19":"female","20":"male","21":"male","22":"female","23":"male","24":"female","25":"female","26":"male","27":"male","28":"female","29":"male","30":"male","31":"female","32":"female","33":"male","34":"male","35":"male","36":"male","37":"male","38":"female","39":"female","40":"female","41":"female","42":"male","43":"female","44":"female","45":"male","46":"male","47":"female","48":"male","49":"female"},"Age":{"0":22.0,"1":38.0,"2":26.0,"3":35.0,"4":35.0,"5":28.0,"6":54.0,"7":2.0,"8":27.0,"9":14.0,"10":4.0,"11":58.0,"12":20.0,"13":39.0,"14":14.0,"15":55.0,"16":2.0,"17":28.0,"18":31.0,"19":28.0,"20":35.0,"21":34.0,"22":15.0,"23":28.0,"24":8.0,"25":38.0,"26":28.0,"27":19.0,"28":28.0,"29":28.0,"30":40.0,"31":28.0,"32":28.0,"33":66.0,"34":28.0,"35":42.0,"36":28.0,"37":21.0,"38":18.0,"39":14.0,"40":40.0,"41":27.0,"42":28.0,"43":3.0,"44":19.0,"45":28.0,"46":28.0,"47":28.0,"48":28.0,"49":18.0},"SibSp":{"0":1,"1":1,"2":0,"3":1,"4":0,"5":0,"6":0,"7":3,"8":0,"9":1,"10":1,"11":0,"12":0,"13":1,"14":0,"15":0,"16":4,"17":0,"18":1,"19":0,"20":0,"21":0,"22":0,"23":0,"24":3,"25":1,"26":0,"27":3,"28":0,"29":0,"30":0,"31":1,"32":0,"33":0,"34":1,"35":1,"36":0,"37":0,"38":2,"39":1,"40":1,"41":1,"42":0,"43":1,"44":0,"45":0,"46":1,"47":0,"48":2,"49":1},"Parch":{"0":0,"1":0,"2":0,"3":0,"4":0,"5":0,"6":0,"7":1,"8":2,"9":0,"10":1,"11":0,"12":0,"13":5,"14":0,"15":0,"16":1,"17":0,"18":0,"19":0,"20":0,"21":0,"22":0,"23":0,"24":1,"25":5,"26":0,"27":2,"28":0,"29":0,"30":0,"31":0,"32":0,"33":0,"34":0,"35":0,"36":0,"37":0,"38":0,"39":0,"40":0,"41":0,"42":0,"43":2,"44":0,"45":0,"46":0,"47":0,"48":0,"49":0},"Fare":{"0":7.25,"1":71.2833,"2":7.925,"3":53.1,"4":8.05,"5":8.4583,"6":51.8625,"7":21.075,"8":11.1333,"9":30.0708,"10":16.7,"11":26.55,"12":8.05,"13":31.275,"14":7.8542,"15":16.0,"16":29.125,"17":13.0,"18":18.0,"19":7.225,"20":26.0,"21":13.0,"22":8.0292,"23":35.5,"24":21.075,"25":31.3875,"26":7.225,"27":263.0,"28":7.8792,"29":7.8958,"30":27.7208,"31":146.5208,"32":7.75,"33":10.5,"34":82.1708,"35":52.0,"36":7.2292,"37":8.05,"38":18.0,"39":11.2417,"40":9.475,"41":21.0,"42":7.8958,"43":41.5792,"44":7.8792,"45":8.05,"46":15.5,"47":7.75,"48":21.6792,"49":17.8},"Embarked":{"0":"S","1":"C","2":"S","3":"S","4":"S","5":"Q","6":"S","7":"S","8":"S","9":"C","10":"S","11":"S","12":"S","13":"S","14":"S","15":"S","16":"Q","17":"S","18":"S","19":"C","20":"S","21":"S","22":"Q","23":"S","24":"S","25":"S","26":"C","27":"S","28":"Q","29":"S","30":"C","31":"C","32":"Q","33":"S","34":"C","35":"S","36":"C","37":"S","38":"S","39":"C","40":"S","41":"S","42":"C","43":"C","44":"Q","45":"S","46":"Q","47":"Q","48":"C","49":"S"},"Sex_Code":{"0":1,"1":0,"2":0,"3":0,"4":1,"5":1,"6":1,"7":1,"8":0,"9":0,"10":0,"11":0,"12":1,"13":1,"14":0,"15":0,"16":1,"17":1,"18":0,"19":0,"20":1,"21":1,"22":0,"23":1,"24":0,"25":0,"26":1,"27":1,"28":0,"29":1,"30":1,"31":0,"32":0,"33":1,"34":1,"35":1,"36":1,"37":1,"38":0,"39":0,"40":0,"41":0,"42":1,"43":0,"44":0,"45":1,"46":1,"47":0,"48":1,"49":0},"Embarked_Code":{"0":2,"1":0,"2":2,"3":2,"4":2,"5":1,"6":2,"7":2,"8":2,"9":0,"10":2,"11":2,"12":2,"13":2,"14":2,"15":2,"16":1,"17":2,"18":2,"19":0,"20":2,"21":2,"22":1,"23":2,"24":2,"25":2,"26":0,"27":2,"28":1,"29":2,"30":0,"31":0,"32":1,"33":2,"34":0,"35":2,"36":0,"37":2,"38":2,"39":0,"40":2,"41":2,"42":0,"43":0,"44":1,"45":2,"46":1,"47":1,"48":0,"49":2}}

I'm playing around with the Titanic data-set from Kaggle. I'm trying to find out what percentage within each Pclass of men and women survived.

Groupby example:

train_df.groupby(['Pclass','Sex','Survived']).apply(lambda x: len(x)).unstack(2).plot(kind='bar')

This shows me within each class how many men and women survived and how many did not, but it would visually be better to see what percentage of men and women survived within each class.

Desired Result:

train_df.groupby(['Pclass','Sex','Survived']).apply(lambda x: len(x)).unstack(2)[1]/(train_df.groupby(['Pclass','Sex','Survived']).apply(lambda x: len(x)).unstack(2)[1]+train_df.groupby(['Pclass','Sex','Survived']).apply(lambda x: len(x)).unstack(2)[0])

This looks like it gets the desired result, but I'm wondering if there is a much more pythonic way of doing this? like a normalize=True option would be slick.

End goal:

A bar chart of the ratio survived for each sex within each Pclass

Yale Newman
  • 1,141
  • 1
  • 13
  • 22

3 Answers3

2

You're on track with len and unstack. However, you can simplify this with count:

v = df.groupby(['Pclass','Sex','Survived']).Sex.count().unstack(2)

Now, find the mean.

v[1] / v.sum(1)

Pclass  Sex   
1       female    1.000000
        male      0.166667
2       female    0.750000
        male      0.500000
3       female    0.647059
        male      0.066667
dtype: float64

Plot with (v[1] / v.sum(1)).plot(kind='bar').

cs95
  • 379,657
  • 97
  • 704
  • 746
  • what's going on with the v.sum(1) bit – Yale Newman Sep 07 '18 at 06:20
  • It's basically a more winding way to take the mean, what piR did with one method call. My aim was to just show you how to get what you wanted with minimal change. – cs95 Sep 07 '18 at 06:25
  • I just don't know what is happening there. v[1] makes sense to me but what is happening with v.sum(1) does that mean sum the rows rather than the columns? – Yale Newman Sep 07 '18 at 06:28
  • @YaleNewman It's basically computing the percentage/ratio of survivors. `num_survivors / (num_survivors + num_not_survivors)` – cs95 Sep 07 '18 at 06:36
2

groupby

df.groupby(['Pclass', 'Sex']).Survived.mean().unstack()

Sex       female      male
Pclass                    
1       1.000000  0.166667
2       0.750000  0.500000
3       0.647059  0.066667

pivot_table

df.pivot_table('Survived', 'Pclass', 'Sex')

Sex       female      male
Pclass                    
1       1.000000  0.166667
2       0.750000  0.500000
3       0.647059  0.066667
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 1
    wait, woah the average of 1 and 0 winds up being the same ratio. mind blown. i guess i need to review binomial distributions? what the – Yale Newman Sep 07 '18 at 06:23
  • train_df.pivot_table('Survived', 'Pclass', 'Sex').plot(kind='bar') this wound up being the best look. how could you get a similar look where each class is grouped tighter together. when i plot the result of the groupby it is much more difficult to interpret than plotting the result of the pivot_table – Yale Newman Sep 07 '18 at 06:35
  • Both options return the same dataframe. I get the same plot. – piRSquared Sep 07 '18 at 06:37
  • Apologies, was referring to the other solutions @piRSsquared. Would this also be possible with crosstab? Ultimately, I want a way to quickly see how different combinations of features correlate with a target variable – Yale Newman Sep 08 '18 at 03:48
2

You can use Groupby sum and size as:

g = df.groupby(['Pclass','Sex'])['Survived']
g.sum()/g.size()

Pclass  Sex   
1       female    1.000000
        male      0.166667
2       female    0.750000
        male      0.500000
3       female    0.647059
        male      0.066667
Name: Survived, dtype: float64

(g.sum()/g.size()).unstack()

Sex female  male
Pclass      
1   1.000000    0.166667
2   0.750000    0.500000
3   0.647059    0.066667
Space Impact
  • 13,085
  • 23
  • 48