query using group by in Python and Excel

Question

Hey just wondering if I was provided the following table

   col_a  col_b
0  A      1
1  B      2
2  B      3
3  C      4
4  D      5
5  D      6
6  E      3
7  A      2
8  A      1

how I can get the following result using Python and Excel? Much appreciated

   col_a  min  max
0  A      1    1
1  B      2    3
2  C      4    4
3  D      5    6
4  E      3    3
5  A      1    2

You need something similar to the following answer. You just need to change aggregation to min and max I think. https://stackoverflow.com/a/41621009/11560290. — Metin Usta, Mar 31 '23 at 00:36
Using what tools? Do you want to "control" Excel, or are you thinking you'll read an xlsx file and produce a different xlsx file? — Tim Roberts, Mar 31 '23 at 00:38

rxsen · Answer 1 · 2023-04-01T16:36:17.443

Python

Using `itertools.groupby()` method

The ìtertools.groupby() function returns consecutive keys and groups from the iterable.
You can read more about how to use itertools.groupby() on the documentation or on this stackoverflow post.

import pandas as pd
import itertools

df = pd.DataFrame({
    'col_a': ['A','B','B','C','D','D','E','A','A'],
    'col_b': [1,2,3,4,5,6,3,2,1]
})

groups = []
for key, group in itertools.groupby(df.index, lambda x: df['col_a'][x]):
    col_b_values = df.loc[group, 'col_b'].tolist() # Extract the 'col_b' values as a list
    groups.append({
        'col_a': key,
        'min': min(col_b_values),
        'max': max(col_b_values)
    })

pd.DataFrame(groups)

Output:

    col_a   min max
0   A   1   1
1   B   2   3
2   C   4   4
3   D   5   6
4   E   3   3
5   A   1   2

Using Pandas `shift().cumsum()` and `groupby` methods

You can create a new column which identifies consecutive rows in col_a using shift().cumsum()

import pandas as pd

df = pd.DataFrame({
    'col_a': ['A','B','B','C','D','D','E','A','A'],
    'col_b': [1,2,3,4,5,6,3,2,1]
})

df['group'] = (df.col_a != df.col_a.shift()).cumsum()

The output will look like:

  col_a col_b group
0   A   1   1
1   B   2   2
2   B   3   2
3   C   4   3
4   D   5   4
5   D   6   4
6   E   3   5
7   A   2   6
8   A   1   6

And then group by the df['group'] column:

df.groupby(['group','col_a'], as_index=True).agg(min=('col_b','min'), max=('col_b', 'max')).reset_index(level='group',drop=True).reset_index()

Output:

 col_a  min max
0   A   1   1
1   B   2   3
2   C   4   4
3   D   5   6
4   E   3   3
5   A   1   2

Excel

Using `LAMBDA` and `OFFSET` functions

With Microsoft's introduction of dynamic arrays in the recent years, we can implement approximately same logic in Excel without the use of VBA.

First we will create an array which for each row of col_a identifies whether it is a duplicate consecutive. Then we create a range that generates a unique consecutive group ID and then add min/max arrays to get minimum and maximum col_b values for each consecutive group ID. Finally we group it by group ID and drop duplicates. The function is:

=LET(
col_a,$A$2:$A$10,
col_b,$B$2:$B$10,
is_dup,MAP(col_a,OFFSET(col_a,-1,0),LAMBDA(a_1,a_2,a_1=a_2)),
groupids,SCAN(0,is_dup,LAMBDA(a,r,IF(r,0+a,a+1))),
_min_val,BYROW(groupids,LAMBDA(r,TAKE(SORT(FILTER(col_b,groupids=r),,1),1))),
_max_val,BYROW(groupids,LAMBDA(r,TAKE(SORT(FILTER(col_b,groupids=r),,-1),1))),
VSTACK({"col_a","min","max"},CHOOSECOLS(UNIQUE(HSTACK(col_a,groupids,_min_val,_max_val)),1,3,4))
)

Excel implementation

Other methods using Excel

You might be able to achieve similar result using the Scripting.Dictionary object in VBA or combination of INDEX and AGGREGATE functions (see this stackoverflow post)

inquirer · Answer 2 · 2023-04-01T14:05:02.757

df['cs'] = (df['col_a'] != df['col_a'].shift()).cumsum()
aaa = (df.groupby(['cs', 'col_a'])['col_b'].aggregate(['min', 'max'])).reset_index().drop(columns=['cs'])

print(aaa)

Output

  col_a  min  max
0     A    1    1
1     B    2    3
2     C    4    4
3     D    5    6
4     E    3    3
5     A    1    2

Here, in the created column 'cs', the current line and the next row are checked for inequality. Where the condition is true, it is set to True, then the cumulative sum is calculated and this column is used for grouping. The aggregates are calculated, the indexes are reset and the 'cs' column is removed.

query using group by in Python and Excel

2 Answers2

Python

Using itertools.groupby() method

Using Pandas shift().cumsum() and groupby methods

Excel

Using LAMBDA and OFFSET functions

Other methods using Excel

Using `itertools.groupby()` method

Using Pandas `shift().cumsum()` and `groupby` methods

Using `LAMBDA` and `OFFSET` functions