1

I have a 3D dataframe with x and y and time as 3rd dimension. The data are 5 indizes of satellite images that were taken at different times. The x and y describes every pixel.

 x        y              time       SIPI       classif
7.620001 -77.849990     2018-04-07  1.011107    2.0
                        2018-10-14  1.023407    2.0
                        2018-12-28  0.045107    3.0
                        2020-01-10  0.351107    2.0
                        2018-06-29  0.351107    2.0
         -77.849899     2018-04-07  1.010777    8.0
                        2018-10-14  0.510562    2.0
                        2018-12-28  1.410766    4.0
                        2020-01-10  1.010666    8.0
                        2018-06-29  2.057068    8.0
         -77.849809     2018-04-07  0.986991    1.0
                        2018-10-14  0.986991    8.0
                        2018-12-28  0.986991    5.0
                        2020-01-10  0.984791    5.0
                        2018-06-29  0.986991    3.0
         -77.849718     2018-04-07  0.975965    10.0
                        2018-10-14  0.964765    7.0
                        2018-12-28  0.975965    10.0
                        2020-01-10  0.975965    10.0
                        2018-06-29  0.975965    3.0
         -77.849627     2018-04-07  1.957747    2.0
                        2018-10-14  0.132445    6.0
                        2018-12-28  0.589677    2.0
                        2020-01-10  1.982445    2.0
                        2018-06-29  3.334456    7.0

I need to group the data and as new column I need the value from column 'classif_rf', which is most frequent in 5 datasets. The values are integers between 1 and 10. I want to add an condition which add only frequency higher than 3.

 x          y           classif
7.620001 -77.849990     2.0
         -77.849899     8.0
         -77.849809     Na
         -77.849718     10.0
         -77.849627     2.0

So as a result I need dataframe where each pixel has a value with highest frequency and when the frequency is lower than 3 there should be a NA value.

Can the pandas.groupby function do that? I thought about value_counts(), but I'm not sure how to implement that to my dataset.

Thank you in advance!

inerini9
  • 55
  • 1
  • 8
  • Does this answer your question? [How to select rows from a DataFrame based on column values?](https://stackoverflow.com/questions/17071871/how-to-select-rows-from-a-dataframe-based-on-column-values) – Marios Aug 07 '20 at 12:23
  • 1
    This is without the frequency greater than three condition: `df.groupby(["x", "y"])["classif"].agg(pd.Series.mode)` – Dan Aug 07 '20 at 12:28
  • I tried something like that, but how the function classify the value for a pixel, if in a column 'classif' are 5 different values? Does the function take the first better value? – inerini9 Aug 07 '20 at 12:38
  • What do you mean "the first best value"? It takes the [mode](https://en.wikipedia.org/wiki/Mode_(statistics)). Are you asking what it does in the event of a tie for the most frequent number in a group? – Dan Aug 07 '20 at 12:43
  • Sorry for incomprehensibility. I mean first-served value. I know that it takes the mode. Exactly, I was not sure, what the function does, if there are tie, like 1, 2, 3, 4, 7. – inerini9 Aug 07 '20 at 13:00
  • What do you want it to do? You haven't specified. In ,y answer, then `[0]` part means to just give you the first one it finds I think. If you want something more specific, like maybe the smallest mode you can change it to `pd.Series.mode(x).min()` – Dan Aug 07 '20 at 13:09

1 Answers1

1

Here is a clunky way to do it:

# Get the modes per group and count how often they occur
df_modes = df.groupby(["x", "y"]).agg(
    {
        'classif': [lambda x: pd.Series.mode(x)[0], 
                    lambda x: sum(x == pd.Series.mode(x)[0])]
    }
).reset_index()
# Rename the columns to something a bit more readable
df_modes.columns = ["x", "y", "classif_mode", "classif_mode_freq"]
# Discard modes whose frequency was less than 3
df_modes.loc[df_modes["classif_mode_freq"] < 3, "classif_mode"] = np.nan

Now df_modes.drop("classif_mode_freq", axis=1) will return

          x          y  classif_mode
0  7.620001 -77.849990           2.0
1  7.620001 -77.849899           8.0
2  7.620001 -77.849809           NaN
3  7.620001 -77.849718          10.0
4  7.620001 -77.849627           2.0
Dan
  • 45,079
  • 17
  • 88
  • 157
  • Ist there a possible to make it without reset index? Because of that I loose the dimensions. My actually dataframe looks like: https://i.stack.imgur.com/A92v3.png and I need something like this: https://i.stack.imgur.com/7cPWA.png. Now I tried this function: count_cond = df.groupby(['x','y']).agg(lambda x:x.value_counts().index[0]) , but I cannot apply there my condition. – inerini9 Aug 07 '20 at 13:23
  • I'm not sure - but you can always set lat and lon to be the index again after you're done – Dan Aug 07 '20 at 14:53