What is the equivalent of ARRAY_AGG in SQL for Pandas DataFrame?

Question

I have a pandas dataframe with price and I will want to create a column called priceLags as shown below:

             price        priceLags
1.           1800
2.           1750          1800

3.           1500          1750
                           1800

4.           1240          1500
                           1750
                           1800

5.           1456          1240
                           1500
                           1750

6.           1302          1456
                           1240
                           1500

priceLags consist of the price from 3 previous rows. In SQL, it is

ARRAY_AGG(price) OVER (ORDER BY ROWS BETWEEN 1 FOLLOWING AND 3 FOLLOWING) AS priceLags

May I ask how can i do it in pandas?

Thank you so much!

Possible duplicate of [Can pandas groupby aggregate into a list, rather than sum, mean, etc?](https://stackoverflow.com/questions/19530568/can-pandas-groupby-aggregate-into-a-list-rather-than-sum-mean-etc) — Sundeep Pidugu, Aug 01 '19 at 03:51

realr · Answer 1 · 2019-08-01T04:49:22.273

One way you can create the same structure is by:

Creating the lagged variables

df['lagged1'] = df['price'].shift(1)
df['lagged2'] = df['price'].shift(2)
df['lagged3'] = df['price'].shift(3)

df
Out[1]
    price   lagged1 lagged2 lagged3
0   1800    NaN     NaN     NaN
1   1750    1800.0  NaN     NaN
2   1500    1750.0  1800.0  NaN
3   1240    1500.0  1750.0  1800.0
4   1456    1240.0  1500.0  1750.0
5   1302    1456.0  1240.0  1500.0

Stacking those new variables

df.set_index('price').stack(dropna=False)\
   .reset_index(1).drop('level_1', axis=1)\
   .reset_index().rename(columns={0:'priceLags'})

Out[2]:
    price   priceLags
0   1800    NaN
1   1800    NaN
2   1800    NaN
3   1750    1800.0
4   1750    NaN
5   1750    NaN
6   1500    1750.0
7   1500    1800.0
8   1500    NaN
9   1240    1500.0
10  1240    1750.0
11  1240    1800.0
12  1456    1240.0
13  1456    1500.0
14  1456    1750.0
15  1302    1456.0
16  1302    1240.0
17  1302    1500.0

You can also drop the null values in the process:

df.set_index('price').stack(dropna=True).reset_index(level=1, drop=True).reset_index().rename(columns={0:'priceLags'})

Out[3]:
    price   priceLags
0   1750    1800.0
1   1500    1750.0
2   1500    1800.0
3   1240    1500.0
...
10  1302    1240.0
11  1302    1500.0

Added

After looking around a bit, I found this great answer on how to programatically create columns with lagged times. We can then stack and reset index a few times to get to our final result in one code call:

df.assign(**{
        f'{col}_{t}': df[col].shift(t)
        for t in lags
        for col in df
    })\
    .set_index('price').stack(dropna=True)\ #group into one column
    .reset_index(level=1, drop=True)\ #remove the column names
    .reset_index().rename(columns={0:'priceLags'}) #reinsert the correct col names

score 0 · Answer 2 · answered Oct 01 '20 at 11:55

Another way you can do this is by defining a custom aggregation function. Not the most elegant code below but probably does what you want:

# import some packages
import pandas as pd
from functools import reduce 

# create a test dataframe
df = pd.DataFrame([
    {'a': 'hello', 'b': 1},
    {'a': 'hello', 'b': 5},
    {'a': 'hello', 'b': 6},
    {'a': 'bubye', 'b': 3},
    {'a': 'bubye', 'b': 2},
    {'a': 'bonus', 'b': 3}
])

# define custom aggregation function
def create_list(series):
    if len(series) == 1:
        return [x for x in series]
    return reduce(lambda x, y: ([x] if type(x) == int else x) + [y], series)

# apply different aggregation functions, including your custom one
(
    df
    .groupby("a")
    .agg({
        "b": ['sum', 'max', create_list],
    })
)

What is the equivalent of ARRAY_AGG in SQL for Pandas DataFrame?

2 Answers2