0
diamonds = sns.load_dataset("diamonds")
diamonds.head()   

ideal_good = diamonds[(diamonds["cut"]=="Ideal") | 
                      (diamonds["cut"]=="Good")]
ideal_good.groupby("cut")["price"].mean()
   carat      cut color clarity  depth  table  price     x     y     z
0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
2   0.23     Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
3   0.29  Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
4   0.31     Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75
cut
Ideal        3457.541970
Premium              NaN
Very Good            NaN
Good         3928.864452
Fair                 NaN
Name: price, dtype: float64

Why am I seeing Premium, Very Good and Fair even though I filtered them out? How do I remove those categories from the output?

wjandrea
  • 28,235
  • 9
  • 60
  • 81
Rizzle
  • 117
  • 6
  • [How to create a Minimal, Reproducible Question](https://stackoverflow.com/help/minimal-reproducible-example) You should not just request a ready solution. SO is for helping solve specific errors, after you've shown your effort solving them – Ron Jan 17 '23 at 00:35
  • Not sure I am following. I created this code and I am not able to move beyond it as an R user. In R, this is very easy to do: diamonds %>% filter(cut == "Ideal"|cut=="Good") %>% distinct(cut) – Rizzle Jan 17 '23 at 00:40
  • provide as small sample diamonds dataframe so we can see what is in it? – Galo do Leste Jan 17 '23 at 00:42
  • @GalodoLeste Done. I don't want a code, I just want to know why this is happening. R does not give these elementary issues. – Rizzle Jan 17 '23 at 00:47
  • I get that, but generally the problem is that code is incorrect. I tried your code and it seems to work fine for me. Can you show the print statement you used to get the output – Galo do Leste Jan 17 '23 at 01:03
  • Hi @GalodoLeste This was what i used to get the output: ideal_good.groupby("cut")["price"].mean() – Rizzle Jan 17 '23 at 01:06
  • No. That is what you used to calculate some averages. In order to print out the output which contains the Nan values you must have used some print statement like:```print(ideal_good)```. The output you have there looks like you used the statement:```print(diamonds.groupby("cut")["price"].mean())``` – Galo do Leste Jan 17 '23 at 01:09
  • 1
    @GalodoLeste OP seems to be using Jupyter, which is a REPL (IPython). Explicitly printing is not necessary in a REPL. – wjandrea Jan 17 '23 at 01:11
  • Just saved it as an object but the output as I suspected would be the same. So @wjandrea is correct. – Rizzle Jan 17 '23 at 01:13
  • Ok, my apologies. However, my statement still stands. The output shown looks like it is ```diamonds.groupby("cut")["price"].mean()``` not ```ideal_good.groupby("cut")["price"].mean()``` – Galo do Leste Jan 17 '23 at 01:14
  • 1
    It looks like it's because `diamonds["cut"]` is [categorical](https://pandas.pydata.org/docs/user_guide/categorical.html). I don't know categoricals very well myself, but FWIW, you can get the output you want by doing `...mean()[["Ideal", "Good"]]`, but that seems clunky. – wjandrea Jan 17 '23 at 01:14
  • Beside the point, but you could shorten the selection: `diamonds[diamonds["cut"].isin(['Ideal', 'Good'])]` – wjandrea Jan 17 '23 at 01:16
  • 1
    Alternatively then you could add the dopna() to the output dataframe after calculating means() – Galo do Leste Jan 17 '23 at 01:18
  • 1
    Thank you for the attempts. I figured it out. I could have added a condition in group by: ideal_good.groupby("cut", observed =True)["price"].mean() – Rizzle Jan 17 '23 at 01:24

0 Answers0