6

I am trying out PCA (principal component analysis) in Spark ML.

data = [(Vectors.dense([1.0, 1.0]),),
  (Vectors.dense([1.0, 2.0]),),
  (Vectors.dense([4.0, 4.0]),), 
  (Vectors.dense([5.0, 4.0]),)]

df = spark.createDataFrame(data, ["features"])
pca = PCA(k=1, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
transformed_feature = model.transform(df)
transformed_feature.show()

Output:

+---------+--------------------+
| features|         pcaFeatures|
+---------+--------------------+
|[1.0,1.0]|[-1.3949716649258...|
|[1.0,2.0]|[-1.976209858644928]|
|[4.0,4.0]|[-5.579886659703326]|
|[5.0,4.0]|[-6.393620130910061]|
+---------+--------------------+

When I tried PCA on same data in scikit-learn as below it given different result

X = np.array([[1.0, 1.0], [1.0, 2.0], [4.0, 4.0], [5.0, 4.0]])
pca = PCA(n_components=1)
pca.fit(X)
X_transformed = pca.transform(X)
for x,y in zip(X ,X_transformed):
    print(x,y)

Output:

[ 1.  1.] [-2.44120041]
[ 1.  2.] [-1.85996222]
[ 4.  4.] [ 1.74371458]
[ 5.  4.] [ 2.55744805]

As you can see there is a difference in output.

To verify the result i calculated PCA for the same data mathematically. I got same result as it from scikit-learn. Below snippet is of pca transformation calculation for first data point (1.0,1.0): enter image description here

as you can see it matches with scikit learn result.

It seems spark ML doesn't subtract the mean vector MX from data vector X i.e. it uses Y = A*(X) in place of Y = A*(X-MX).

For point (1.0,1.0):

Y = (0.814*1.0)+(0.581*1.0)) = 1.395 

which is same result which we got with spark ML.

Is Spark ML is giving wrong result or am I missing something?

Shaido
  • 27,497
  • 23
  • 70
  • 73
Deepak Kumar
  • 433
  • 4
  • 12

1 Answers1

5

In Spark, the PCA transformation will not scale the input data automatically for you. You need to take care of that yourself before applying the method. To normalize the mean of the data, StandardScaler can be used in the following way:

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                    withStd=False, withMean=True)
scaled_df = scaler.fit(df).transform(df)

The PCA method can then be applied on the scaled_df in the same way as before and the results will match what was given by scikit-learn.


I would recommend to make use of the Spark ML pipeline to simplify the process. To use the standardization and PCA together, it could look like this:

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                    withStd=False, withMean=True)
pca = PCA(k=1, inputCol=scaler.getOutputCol(), outputCol="pcaFeatures")
pipeline = Pipeline(stages=[scaler , pca])

model = pipeline.fit(df)
transformed_feature = model.transform(df)
Shaido
  • 27,497
  • 23
  • 70
  • 73
  • Thanks for pointing about feature sclaing before applying PCA. but still output doesn't matches. seems it's just an implementation difference that Spark ML uses Y = A*(X) and scikit-learn Y = A*(X-MX). – Deepak Kumar Dec 26 '17 at 05:17
  • @DeepakKumar: Yes, there is an implementation difference. But it is not wrong. In Spark you are required to do the normalization yourself as stated in the answer. Does the numbers not match after you have done the mean normalization? – Shaido Dec 26 '17 at 13:39
  • magnitude wise numbers matches (2.44,1.85,-1.74,-2.5) but there sign are just opposite. – Deepak Kumar Dec 28 '17 at 05:36
  • @DeepakKumar: In PCA the sign of the vector can be flipped without changing the outcome. Sklearn will flip the signs in some cases, you can see a good answer explaining this here: https://stackoverflow.com/questions/44765682/in-sklearn-decomposition-pca-why-are-components-negative. – Shaido Dec 28 '17 at 10:46