4

I am new to pandas. I am using pandas to read a CSV file of timestamped records, into a dataframe. The data has the following columns:

timestamp COLUMN_A COLUMN_B COLUMN_C

After reading the data into the dataframe, I want to be able to run a windowing function on COLUMN_C; The function should return the timestamped values of the column.

I have written something that works for iterables:

import collections
import itertools


def sliding_window_iter(iterable, size):
    """Iterate through iterable using a sliding window of several elements.

    Creates an iterable where each element is a tuple of `size`
    consecutive elements from `iterable`, advancing by 1 element each
    time. For example:

    >>> list(sliding_window_iter([1, 2, 3, 4], 2))
    [(1, 2), (2, 3), (3, 4)]
    """
    iterable = iter(iterable)
    window = collections.deque(
        itertools.islice(iterable, size-1),
        maxlen=size
    )
    for item in iterable:
        window.append(item)
        yield tuple(window)

How do I modify this to work on the column of a dataframe?

Homunculus Reticulli
  • 65,167
  • 81
  • 216
  • 341
  • Pandas has a pretty [comprehensive set of window functions](https://pandas.pydata.org/pandas-docs/stable/reference/window.html) built-in, have you checked to see if what you need already exists? – G. Anderson Mar 04 '21 at 21:09
  • Does this answer your question? [Select next N rows in pandas dataframe using iterrows](https://stackoverflow.com/questions/57198121/select-next-n-rows-in-pandas-dataframe-using-iterrows). You'd still need togenerate the bounds of each successive slice. – aneroid Mar 04 '21 at 21:40
  • Column values only? Or is the index needed for the operation(s).? – wwii Mar 05 '21 at 04:51

1 Answers1

5

It's simpler to successively slice through the dataframe. Since you want overlapping windows [(1, 2), (2, 3), (3, 4), ...], you could write it like this:

def sliding_window_iter(series, size):
    """series is a column of a dataframe"""
    for start_row in range(len(series) - size + 1):
        yield series[start_row:start_row + size]

Usage:

df = pd.DataFrame({'A': list(range(100, 501, 100)),
                   'B': list(range(-20, -15)),
                   'C': [0, 1, 2, None, 4]},
                  index=pd.date_range('2021-01-01', periods=5))

list(sliding_window_iter(df['C'], 2))

Output:

[2021-01-01    0.0
 2021-01-02    1.0
 Freq: D, Name: C, dtype: float64,
 2021-01-02    1.0
 2021-01-03    2.0
 Freq: D, Name: C, dtype: float64,
 2021-01-03    2.0
 2021-01-04    NaN
 Freq: D, Name: C, dtype: float64,
 2021-01-04    NaN
 2021-01-05    4.0
 Freq: D, Name: C, dtype: float64]

Also works if you pass in multiple columns:

list(sliding_window_iter(df.loc[:, ['A', 'C']], 2))

#output:
[              A    C
 2021-01-01  100  0.0
 2021-01-02  200  1.0,
               A    C
 2021-01-02  200  1.0
 2021-01-03  300  2.0,
               A    C
 2021-01-03  300  2.0
 2021-01-04  400  NaN,
               A    C
 2021-01-04  400  NaN
 2021-01-05  500  4.0]
aneroid
  • 12,983
  • 3
  • 36
  • 66