Get min and max values of categorical variable in a dataframe

Question

I have a dataframe that looks like this:

D   X   Y   Z
A  22  16  23
A  21  16  22
A  20  17  21
B  33  50  11
B  34  53  12
B  34  55  13
C  44  34  11
C  45  33  11
C  45  33  10
D  55  35  60
D  57  34  61
E  66  36  13
E  67  38  14 
E  67  37  13

I want to get the minimum and maximum values of the categorical variable D across all the column values and so the output dataframe should look something like this:

D Xmin Xmax Ymin Ymax Zmin Zmax
A 20    22   16   17   21   23 
B 33    34   50   55   11   13    
C 44    45   33   34   10   11  
D 55    57   34   35   60   61
E 66    67   36   38   13   14

I have tried this, but no luck:

min_max_df = dfObj.groupby('D').agg({'X': [dfObj.min(axis=0), dfObj.max(axis=0)]})

score 13 · Answer 1 · edited Nov 13 '21 at 01:11

df = df.groupby('D').agg(['min', 'max'])

Output:

>>> df
    X       Y       Z    
  min max min max min max
D                        
A  20  22  16  17  21  23
B  33  34  50  55  11  13
C  44  45  33  34  10  11
D  55  57  34  35  60  61
E  66  67  36  38  13  14

>>> df['X']['min']
D
A    20
B    33
C    44
D    55
E    66
Name: min, dtype: int64

You can flatten the columns as well:

df.columns = df.columns.map(''.join)
df.rename_axis(None)

   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14

score 10 · Answer 2 · answered Nov 13 '21 at 00:30

10

from itertools import product
aggs = {f"{col}{fn}": (col, fn) for col,fn in product(['X', 'Y', 'Z'], ['min', 'max'])}
df.groupby('D').agg(**aggs)
>>>
   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
D
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14

answered Nov 13 '21 at 00:30

Asish M.

2,588
1
16
31

This is the fastest, per HarryPlotter's [benchmarks](https://stackoverflow.com/a/69950622/17242583). – Nov 13 '21 at 15:05
1

@user17242583: Not per cycle. (Though it is not clear way the two kinds of number would not be consistent - due to some fixed overhead?) – Peter Mortensen Nov 14 '21 at 07:36
Yeah actually, for an individual run it looks like mine (and Harry's) was faster. – Nov 14 '21 at 15:09

score 9 · Accepted Answer · edited Nov 14 '21 at 15:07

I believe this is a nice way of doing it and in a single line of code. Making use of join doing the operation by index and the rsuffix and lsuffix to differentiate min and max.

output = df.groupby('D').min().join(df.groupby('D').max(), lsuffix='min', rsuffix='max')

Outputs:

   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
D
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14

score 7 · Answer 4 · edited Nov 14 '21 at 15:07

7

df = df.groupby('D').agg(['min', 'max'])
df.columns = df.columns.map("".join)  # flatten multi-level columns

Output

>>> df

   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
D                                    
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14

Speed Test

I did some simple tests comparing Celius's, Asish's and user17242583's/my solutions.

Setup

import numpy as np 
import pandas as pd 
from itertools import product

n = 1_000_000
rng = np.random.default_rng() 

df = pd.DataFrame({
    'D': rng.choice(list("ABCDEFGH"), size=n),
    'X': rng.integers(100, size=n),
    'Y': rng.integers(100, size=n),
    'Z': rng.integers(100, size=n),
})

>>> df.shape
(1000000, 4)

Results

Here are the results.

# Ashish's 
>>> %%timeit -n 50
... aggs = {f"{col}{fn}": (col, fn) for col,fn in product(['X', 'Y', 'Z'], ['min', 'max'])}
... df1 = df.groupby('D').agg(**aggs) 

116 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 50 loops each)

# user17242583's / mine 
>>> %%timeit -n 50
... df1 = df.groupby('D').agg(['min', 'max'])
... df1.columns = df1.columns.map("".join)  # flat multi-level columns

120 ms ± 4.69 ms per loop (mean ± std. dev. of 7 runs, 50 loops each)

#Celius's
>>> %%timeit -n 50
... df1 = df.groupby('D').min().join(df.groupby('D').max(), lsuffix='min', rsuffix='max')

178 ms ± 6.29 ms per loop (mean ± std. dev. of 7 runs, 50 loops each)

Comparing the first two with the last one, we can conclude that using DataFrameGroupBy.agg is a little faster for large DataFrames.

edited Nov 14 '21 at 15:07

answered Nov 13 '21 at 01:02

Rodalm

5,169
5
21

1

@user17242583 Yes, but only for >~1M rows. For ~100k rows, the timings are practically identical ;) Lower than that, Celius's answer seems faster! – Rodalm Nov 13 '21 at 01:47
2

I would have expected in no scenario for my answer to be the fastest (a join operation with two gruopings inside it) however indeed for <100k rows time is usually deprioritized over readability and applicability for a newer user. As long as bad habits are not taught (like using iterrows when you can have a vectorized solution) I try to aim for a one-liner that does the trick. – Celius Stingher Nov 13 '21 at 01:49
1

@CeliusStingher yeah I understood, and I totally agree with you! I was just curious how big was the difference between using `join` and `agg`. – Rodalm Nov 13 '21 at 01:51
Re *"I did some simple tests"*: On what system (hardware (CPU type, clock frequency, number of cores, etc.), operating system), incl. version? Using what software and which versions/compiler flags (e.g., hardware acceleration)/compiler configuration (e.g. for Python and Pandas)? Under what conditions, e.g., cold (first time run) / warm. Please respond by [editing (changing) your answer](https://stackoverflow.com/posts/69950622/edit), not here in comments (***without*** "Edit:", "Update:", or similar - the answer should appear as if it was written today). – Peter Mortensen Nov 14 '21 at 07:24
Related: *[What makes a "good" performance question on Stack Overflow?](https://meta.stackoverflow.com/questions/412875/what-makes-a-good-performance-question-on-so/412904#412904)* – Peter Mortensen Nov 14 '21 at 07:33
1

I think the answer is good enough for general knowledge. I am curious about something, while specifying all the details you mention might help create a more objective measurement, there is probably no way we can't match OP specs and most users have different specs too. Therefore all our measurements will be different from OPs but the most efficient solution vs the least efficient one might stand across most specs. – Celius Stingher Nov 14 '21 at 13:08
2

@CeliusStingher Thank you, I completely agree. During many years of reading SO posts, I never saw someone providing those details in a simple performance test. The average user isn’t interested in them. The setup is precisely given so everyone (especially OP) can easily compare the timings in their own machine if they want. It shouldn’t be seen as a rigorous benchmark, as many more tests would be needed. The point was just to show that for large enough data, using `agg` instead of `join` is faster due to their implementation, which is something we're all aware of. – Rodalm Nov 14 '21 at 15:02

Get min and max values of categorical variable in a dataframe

4 Answers4

Speed Test