5

I'm wondering if anyone knows of how to implement a rolling/moving window PCA on a pandas dataframe. I've looked around and found implementations in R and MATLAB but not Python. Any help would be appreciated!

This is not a duplicate - moving window PCA is not the same as PCA on the entire dataframe. Please see pandas.DataFrame.rolling() if you do not understand the difference

Michael
  • 7,087
  • 21
  • 52
  • 81
  • 3
    That's too broad. Describe what exactly you want and what's wrong with a simple for-loop over your dataframe, each using sklearn's pca? You mention similar tools in other languages, yet there is no link or any formal description. – sascha Aug 29 '17 at 00:00
  • 1
    Why would you want a rolling PCA? It doesn't make sense from a statistical point of view. – Stergios Aug 29 '17 at 06:31
  • 4
    The same reason you want a rolling mean or a rolling standard deviation. the underlying data is a time series – Michael Aug 29 '17 at 19:01
  • @Michael A little late to the party, but I just left an answer [here](https://stackoverflow.com/questions/73652615/is-there-a-rolling-implementation-of-pca-in-python/73652616#73652616) which you find valuable – PyRsquared Sep 08 '22 at 16:54

1 Answers1

5

Unfortunately, pandas.DataFrame.rolling() seems to flatten the df before rolling, so it cannot be used as one might expect to roll over the rows of the df and pass windows of rows to the PCA.

The following is a work-around for this based on rolling over indices instead of rows. It may not be very elegant but it works:

# Generate some data (1000 time points, 10 features)
data = np.random.random(size=(1000,10))
df = pd.DataFrame(data)

# Set the window size
window = 100

# Initialize an empty df of appropriate size for the output
df_pca = pd.DataFrame( np.zeros((data.shape[0] - window + 1, data.shape[1])) )

# Define PCA fit-transform function
# Note: Instead of attempting to return the result, 
#       it is written into the previously created output array.
def rolling_pca(window_data):
    pca = PCA()
    transf = pca.fit_transform(df.iloc[window_data])
    df_pca.iloc[int(window_data[0])] = transf[0,:]
    return True

# Create a df containing row indices for the workaround
df_idx = pd.DataFrame(np.arange(df.shape[0]))

# Use `rolling` to apply the PCA function
_ = df_idx.rolling(window).apply(rolling_pca)

# The results are now contained here:
print df_pca

A quick check reveals that the values produced by this are identical to control values computed by slicing appropriate windows manually and running PCA on them.

WhoIsJack
  • 1,378
  • 2
  • 15
  • 25
  • Is this equivalent in runtime to manually slicing and performing independent PCA on each slice? Or is there something that lets you reuse the existing PCA every time you step forward on your window, thus saving time? – Jacob Steinebronn Jun 04 '20 at 21:50
  • It is equivalent to independent PCAs. Would be interesting to try and find a way to keep the existing PCA. Perhaps scikit-learn's `IncrementalPCA` could serve as inspiration. – WhoIsJack Jun 05 '20 at 08:36
  • I've been looking into that, but the IPCA can't remove a record so it's only half a solution – Jacob Steinebronn Jun 05 '20 at 13:41
  • 1
    the code gives me a keyerror: 0 at this line --> df_pca.iloc[int(window_data[0])] = transf[0,:]..any idea on why is it? – Luigi87 Jul 14 '21 at 13:53
  • try window_data.iloc[0] as it's a pd.Series – user3882675 Aug 26 '21 at 19:55
  • What exactly is the window data supposed to be? Would appreciate some clarity on that please. – Draco D Sep 26 '21 at 09:49