4

I've got a rather large dataset that I would like to decompose but is too big to load into memory. Researching my options, it seems that sklearn's IncrementalPCA is a good choice, but I can't quite figure out how to make it work.

I can load in the data just fine:

f = h5py.File('my_big_data.h5')
features = f['data']

And from this example, it seems I need to decide what size chunks I want to read from it:

num_rows = data.shape[0]     # total number of rows in data
chunk_size = 10              # how many rows at a time to feed ipca

Then I can create my IncrementalPCA, stream the data chunk-by-chunk, and partially fit it (also from the example above):

ipca = IncrementalPCA(n_components=2)
for i in range(0, num_rows//chunk_size):
    ipca.partial_fit(features[i*chunk_size : (i+1)*chunk_size])

This all goes without error, but I'm not sure what to do next. How do I actually do the dimension reduction and get a new numpy array I can manipulate further and save?

EDIT
The code above was for testing on a smaller subset of my data – as @ImanolLuengo correctly points out, it would be way better to use a larger number of dimensions and chunk size in the final code.

JeffThompson
  • 1,538
  • 3
  • 24
  • 47
  • You are missing some features in range(0, num_rows//chunk_size). Say, num_rows=119, and chunk_size=10. Then final i is i=10 and the final feature is 110. Last 9 features are not included. – HD2000 Nov 09 '22 at 14:29

1 Answers1

7

As you well guessed the fitting is done properly, although I would suggest increasing the chunk_size to 100 or 1000 (or even higher, depending on the shape of your data).

What you have to do now to transform it, is actually transforming it:

out = my_new_features_dataset # shape N x 2
for i in range(0, num_rows//chunk_size):
    out[i*chunk_size:(i+1) * chunk_size] = ipca.transform(features[i*chunk_size : (i+1)*chunk_size])

And thats should give you your new transformed features. If you still have too many samples to fit in memory, I would suggest using out as another hdf5 dataset.

Also, I would argue that reducing a huge dataset to 2 components is probably not a very good idea. But is hard to say without knowing the shape of your features. I would suggest reducing them to sqrt(features.shape[1]), as it is a decent heuristic, or pro tip: use ipca.explained_variance_ratio_ to determine the best amount of features for your affordable information loss threshold.


Edit: as for the explained_variance_ratio_, it returns a vector of dimension n_components (the n_components that you pass as parameter to IPCA) where each value i inicates the percentage of the variance of your original data explained by the i-th new component.

You can follow the procedure in this answer to extract how much information is preserved by the first n components:

>>> print(ipca.explained_variance_ratio_.cumsum())
[ 0.32047581  0.59549787  0.80178824  0.932976    1.        ]

Note: numbers are ficticius taken from the answer above assuming that you have reduced IPCA to 5 components. The i-th number indicates how much of the original data is explained by the first [0, i] components, as it is the cummulative sum of the explained variance ratio.

Thus, what is usually done, is to fit your PCA to the same number of components than your original data:

ipca = IncrementalPCA(n_components=features.shape[1])

Then, after training on your whole data (with iteration + partial_fit) you can plot explaine_variance_ratio_.cumsum() and choose how much data you want to lose. Or do it automatically:

k = np.argmax(ipca.explained_variance_ratio_.cumsum() > 0.9)

The above will return the first index on the cumcum array where the value is > 0.9, this is, indicating the number of PCA components that preserve at least 90% of the original data.

Then you can tweek the transformation to reflect it:

cs = chunk_size
out = my_new_features_dataset # shape N x k
for i in range(0, num_rows//chunk_size):
    out[i*cs:(i+1)*cs] = ipca.transform(features[i*cs:(i+1)*cs])[:, :k]

NOTE the slicing to :k to just select only the first k components while ignoring the rest.

Imanol Luengo
  • 15,366
  • 2
  • 49
  • 67
  • So I don't use `partial_fit` at all? My understanding (which could be wrong) is that you do `partial_fit` first through the data, then transform the whole thing. – JeffThompson Jun 02 '17 at 18:20
  • @JeffThompson Yes you have to, that is AFTER you have done the partial fit. You have to first fit with all your data, and after that is done, transform all your data. – Imanol Luengo Jun 02 '17 at 18:21
  • I see – so you have to loop through the chunks again, after the `partial_fit` loop? – JeffThompson Jun 02 '17 at 18:25
  • 1
    @JeffThompson Yes, you essentially have to loop the data twice. If it does fit in your memory, you could do `new_data = ipca.fit_transform(features)` in a single step. But if it doesn't you have to first `ipca.partial_fit` (by iterating over samples, and then `ipca.transform`, again iterating over samples if they don't fit on memory. – Imanol Luengo Jun 02 '17 at 18:59
  • 1
    Super helpful, thanks! Agreed re not reducing to 2 components in this step, or using a chunk size of 10. This was just from some test code – the plan is to reduce to about 50 dims, then down to 2 with t-SNE. – JeffThompson Jun 02 '17 at 19:11
  • sorry for all the comments! Thanks for adding the suggestions re number of features, very helpful. Can you explain how to use the `explained_variance_ratio_` for this? – JeffThompson Jun 02 '17 at 20:48
  • There is no need to `partial_fit` after finding the `n_components` from the `explained_variance_ratio` right? Just partial `transform` and done? – MehmedB Jun 14 '19 at 12:43
  • And I know OP didn't mention it but I wonder how to apply `inverse_transform` to the partially transformed data? – MehmedB Jun 14 '19 at 12:54
  • How do you initialize the `out` variable? What is the `my_new_features_dataset`? – otayeby Sep 30 '20 at 19:25