3

In the documentation for the PCA function in scikitlearn, there is a copy argument that is True by default.

The documentation says this about the argument:
If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.

I'm not sure what this is saying, however, because how would the function overwrite the input X? When you call .fit(X), the function should just be calculating the PCA vectors and updating the internal state of the PCA object, right? So even if you set copy to False, the .fit(X) function should still be returning the object self as it says in the documentation, so shouldn't fit(X).transform(X) still work?

So what is it copying when this argument is set to False?

Additionally, when would I want to set it to False?

Edit: I ran the fit and transform function together and separately and got different results even though the copy parameter was the same for both.

from sklearn.decomposition import PCA
import numpy as np

X = np.arange(20).reshape((5,4))

print("Separate")
XT = X.copy()
pcaT = PCA(n_components=2, copy=True)
print("Original: ", XT)
results = pcaT.fit(XT).transform(XT)
print("New: ", XT)
print("Results: ", results)

print("\nCombined")
XF = X.copy()
pcaF = PCA(n_components=2, copy=True) 
print("Original: ", XF)
results = pcaF.fit_transform(XF)
print("New: ", XF)
print("Results: ", results)

########## Results
Separate
Original:  [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]]
New:  [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]]
Results:  [[  1.60000000e+01  -2.66453526e-15]
 [  8.00000000e+00  -1.33226763e-15]
 [  0.00000000e+00   0.00000000e+00]
 [ -8.00000000e+00   1.33226763e-15]
 [ -1.60000000e+01   2.66453526e-15]]

Combined
Original:  [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]]
New:  [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]]
Results:  [[  1.60000000e+01   1.44100598e-15]
 [  8.00000000e+00  -4.80335326e-16]
 [ -0.00000000e+00   0.00000000e+00]
 [ -8.00000000e+00   4.80335326e-16]
 [ -1.60000000e+01   9.60670651e-16]]
rasen58
  • 4,672
  • 8
  • 39
  • 74
  • I don't understand your question (though I'm not familiar with sklearn in particular). A call to `.fit(X)` has access to `X`, so there's nothing stopping that function call from mutating `X`. Are you familiar with how native python works in this regard? – Andras Deak -- Слава Україні Nov 25 '17 at 01:43
  • Or is your point that `.fit` is a method of a `PCA` object but the mutation would affect the `X` array? Have you just tried calling `.fit(X)` with `copy=False` and printing it before/after the fit? Not copying the input data might be beneficial from a memory management standpoint, in case you know you won't need the original data afterwards. – Andras Deak -- Слава Україні Nov 25 '17 at 01:51
  • @AndrasDeak I know what mutations normally are. I was confused since I didn't know why you would want to mutate the input X since that would change the dimensions of that matrix. – rasen58 Nov 25 '17 at 02:07
  • I also updated the question with results of the run, and now I'm more confused since the functions seem to produce different results. – rasen58 Nov 25 '17 at 02:07
  • The differences you see are on the order of machine precision; for all intents and purposes the results are the same. What would be interesting to check is the same with `copy=False` and `copy=True`. The suggestion of the documentation you're testing _should_ give the same result, and it does, so there's no question there. – Andras Deak -- Слава Україні Nov 25 '17 at 02:29
  • Also, I don't think the _shape_ of `X` would change, but the _values_ inside could. As I said before: use `copy=False` and print `X` before and after a call to `.fit(X)` and see if anything happens. It's also possible that mutation of `X` is not guaranteed, only permitted. – Andras Deak -- Слава Україні Nov 25 '17 at 02:30
  • Yeah, I tried True and False and they resulted in the same values, so maybe mutation of X is not guaranteed. But my question is still what could it be changed to? – rasen58 Nov 25 '17 at 03:58
  • My original answer was incorrect; see the revised version. –  Dec 01 '17 at 19:44

1 Answers1

4

In your example the value of copy ends up being ignored, as explained below. But here is what can happen if you set it to False:

X = np.arange(20).reshape((5,4)).astype(np.float64)
print(X)
pca = PCA(n_components=2, copy=False).fit(X)
print(X)

This prints the original X

[[  0.   1.   2.   3.]
 [  4.   5.   6.   7.]
 [  8.   9.  10.  11.]
 [ 12.  13.  14.  15.]
 [ 16.  17.  18.  19.]]

and then shows that X was mutated by fit method.

[[-8. -8. -8. -8.]
 [-4. -4. -4. -4.]
 [ 0.  0.  0.  0.]
 [ 4.  4.  4.  4.]
 [ 8.  8.  8.  8.]]

The culprit is this line: X -= self.mean_, where augmented assignment mutates the array.

If you set copy=True, which is the default value, then X is not mutated.

Copy is sometimes made even if copy=False

Why has not copy made a difference in your example? The only thing that the method PCA.fit does with the value of copy is pass it to a utility function check_array which is called to make sure the data matrix has datatype either float32 or float64. If the data type isn't one of those, type conversion happens, and that creates a copy anyway (in your example, there is conversion from int to float). This is why in my example above I made X a float array.

Tiny differences between fit().transform() and fit_transform()

You also asked why fit(X).transform(X) and fit_transform(X) return slightly different results. This has nothing to do with copy parameter. The difference is within the errors of double-precision arithmetics. And it comes from the following:

  • fit performs the SVD as X = U @ S @ V.T (where @ means matrix multiplication) and stores V in the components_ property.
  • transform multiplies the data by V
  • fit_transform performs the same SVD as fit does, and then returns U @ S

Mathematically, U @ S is the same as X @ V because V is an orthogonal matrix. But the errors of floating-point arithmetic result in tiny differences.

It makes sense that fit_transform does U @ S instead of X @ V; it's a simpler and more accurate multiplication to perform because S is diagonal. The reason fit does not do the same is that only V is stored, and in any case it doesn't really know that the argument it got was the same that the model got to fit.