1

So I've got a simple summary dataframe of sales by gender that goes:

Gender |    Sales
___________________
M      |    25
F      |    30

All I want to do now is to return a line in Python that reads: The mean gap in the amount sold is 16.67%

This is simply 30 - 25 divided by 30, and multiplied by 100; and I want a % sign at the end.

I have tried:

m_sales = df.loc[df['Gender'] == 'M']
f_sales = df.loc[df['Gender'] == 'F']

print('The mean gap in the amount sold is:', m_sales['Sales'] - f_sales['Sales'] / m_sales['Sales'] * 100, '%')

Unfortunately this does not work. I get:

The mean gap in the amount sold is: 0 NaN 1 NaN Name: Sales, dtype: object %

Thoughts please? I am very much a beginner so sorry for such a basic query!

smci
  • 32,567
  • 20
  • 113
  • 146
  • 1
    Seems like an straight-forward `loc` usage. Have you tried it? – rafaelc Oct 09 '20 at 16:20
  • 1
    This is very easy. What things have your tried, and where did you get stuck? You have to try your own code and post it. It's not allowed on SO to just post a spec and ask other people to write code for you. – smci Oct 09 '20 at 16:21
  • Does this answer your question? [pandas get column average/mean](https://stackoverflow.com/questions/31037298/pandas-get-column-average-mean) – ayrusme Oct 09 '20 at 16:22
  • @smci - apologies didn't mean to just leave it as-is, was in a hurry and forgot to write my own code. Have updated my query. – SQLGIT_GeekInTraining Oct 09 '20 at 16:36
  • @thatrockbottomprogrammer - No - I have already used `groupby()` and `mean()` on my previous dataset to calculate the mean sales, where I'm getting stuck is just printing out my result as I've explained above. – SQLGIT_GeekInTraining Oct 09 '20 at 16:39
  • Since you essentially have a single column (Sales) indexed by Gender, rather than a full dataframe, it's cleanest to read in with `read_csv(..., index_col=[0], squeeze=True)`. Then you can directly use `.loc['M']` like you want to. – smci Oct 09 '20 at 17:22
  • You can simplify your formula to `(1 - sl.loc['M']/sl.loc['F'])`. And in fact if it's given that there are exactly two Genders and you're only interested in the relative diff, without knowing which gender sold more, you could use `1 - float(sl.diff().tail(1) / sl.max())`. – smci Oct 09 '20 at 17:25
  • If you want to take things one step further and reduce to `sl.to_dict()`, now you can directly reference sl['M']/sl['F']. This is the clearest and most compact, but it loses the pandas type and methods, also hardcodes the genders. – smci Oct 09 '20 at 18:13
  • Also, [f-strings](https://docs.python.org/3/reference/lexical_analysis.html#f-strings) even have a datatype for percent: `:.2%`, so we don't need the `100. * (...)` boilerplate either. See my answer. – smci Oct 09 '20 at 18:22

2 Answers2

0

Append ["Sales"].iloc[0] to your filter expressions to obtain directly values for M and F, and then project these changes to the print() function, too:

m_sales = df.loc[df['Gender'] == 'M']["Sales"].iloc[0]
f_sales = df.loc[df['Gender'] == 'F']["Sales"].iloc[0]

print('The mean gap in the amount sold is:', (f_sales - m_sales) / f_sales * 100, '%')
The mean gap in the amount sold is: 16.666666666666664 %

The explanation:

  • df.loc[df['Gender'] == 'M'] is a dataframe;
  • selecting the "Sales" column by appending ["Sales"] you obtain a series (with only 1 element), and
  • by appending .iloc[0] you obtain the first (=the only one) element of that series.

Note:

You may use f-string (for Python 3.6+) or .format() method for adjusting your output, e.g.

print(f'The mean gap in the amount sold is: {(f_sales - m_sales) / f_sales * 100:.2f}%')
The mean gap in the amount sold is: 16.67%
MarianD
  • 13,096
  • 12
  • 42
  • 54
  • Makes perfect sense - thank you, I can see what I was missing. However, I get the following error: --------------------------------------------------------------------------- TypeError Traceback (most recent call last) in 2 f_sales = df.loc[df['Gender'] == 'F']["Sales"].iloc[0] 3 ----> 4 print('The mean gap in the amount sold is:', (m_sales - f_sales) / m_sales * 100, '%') TypeError: unsupported operand type(s) for -: 'str' and 'str' – SQLGIT_GeekInTraining Oct 09 '20 at 17:13
  • I tested it and it worked OK. You probably have string values in your column "Sales" (try `df.dtypes` to see data types of your columns). Use `int()` or `float()` functions to convert the selected values for `m_sales` and `f_sales` into numbers: `m_sales = int(m_sales)`; `f_sales = int(f_sales)` before the `print()` function. – MarianD Oct 09 '20 at 17:26
  • Yes - I think when I was testing it in the sandbox I had generated a fresh dataframe but forgot to check for data types. Works like a charm - special thanks to you for such a clean and crisp explanation -really helps a beginner like me! – SQLGIT_GeekInTraining Oct 09 '20 at 17:27
0

Ok, you want to be able to index your sales directly by gender (using .loc[]), so for that we read your dataframe with index_col=[0] to set index to Gender column, then squeeze=True to reduce the remaining 1-column dataframe into a Series.

Then I use an f-string for formatting. Note we can inline your expression inside the f-string:

import pandas as pd
from io import StringIO    

dat = """\
Gender |    Sales
___________________
M      |    25
F      |    30
"""

sl = pd.read_csv(StringIO(dat), sep='\s*\|\s*', skiprows=[1], index_col=[0],
    engine='python', squeeze=True)

#               Sales
# Gender            
# M               25
# F               30

print(f"The mean gap in the amount sold is: {100.*(1 - sl.loc['M']/sl.loc['F']):.2f}%")
# The mean gap in the amount sold is: 16.67%

# ...but f-strings even have a datatype for percent: `:.2%`, so we don't need the `100. * (...)` boilerplate.
print(f"The mean gap in the amount sold is: {(1 - sl.loc['M']/sl.loc['F']):.2%}")
The mean gap in the amount sold is: 16.67%

...and if you want to take things one step further and reduce the df -> Series -> dict, do sl.to_dict(), now you can directly reference sl['M']/sl['F'] like you might want to (obviously we lose all the rich methods of a Series.)

smci
  • 32,567
  • 20
  • 113
  • 146
  • This is very interesting - hadn't considered something like this.. will revisit this as it's going to take me a little while to digest :) – SQLGIT_GeekInTraining Oct 09 '20 at 17:28
  • Sure. In general try to aggressively reduce the dimensionality and eliminate any unnecessary columns. NA removal, filtering, aggregating and setting the index usually do the trick. – smci Oct 09 '20 at 17:30
  • If you want to take things one step further and reduce the df -> Series -> dict, do `sl.to_dict()`, now you can directly reference `sl['M']/sl['F']` like you might want to. (obviously we lose all the rich methods of a Series.) – smci Oct 09 '20 at 18:09