Perform PCA on pyspark dataframe

Question

I have build a PySpark dataframe using:

data = sqlContext.read.load('data.csv' , format='com.databricks.spark.csv', delimiter = ',' ,header='true',inferSchema='true')

And I want to perform PCA on my dataframe my dataframe schema is

>>data
DataFrame[col0: double, col1: double, col2: double, col3: double, col4: double]

>>> data.show()
+---------------+---------------+---------------+---------------+---------------+
|           col0|           col1|           col2|           col3|           col4|
+---------------+---------------+---------------+---------------+---------------+
|   -8.801490628| -1.68848604044|  6.29108688718|  1.68614762629| -2.78418041902|
|  6.99040350558| -2.79455708195| -5.57115314522|  4.22337477957|-0.366589003047|
|   6.8950808389|  7.65514024658|   8.0214838208| -5.12100927058|  3.17467779733|
|  6.74150161414|  1.19627062139| 0.821181991602|  5.12589137044| -3.86248588187|
|  9.15545404244|  7.80553468656|  -8.1232517076|   2.6242726214| -7.59049824307|
|   -6.014643738|-0.470165781449|-0.226389435704| -2.55837378209| -2.06405566854|
| -9.49629160445| -9.85331556717| -7.44474566663|  6.48359295657|  9.75680835864|
| 0.450876020546| -3.55454445478| -2.82100689682|  5.15104966779| -7.70810268078|
| -7.21960567005| 0.102168086158| -1.46779736909| -3.87897074493| -3.17592118456|
| -8.75820987524| -8.63519048007| -4.20447284625|-0.394878764685| -5.79070138764|
|  9.47825273869|  6.02827892008|  -9.7181540689|  -9.0341215112|  5.96203870171|
| -1.56616611175|  1.64353225245|  9.20883287312|-0.158689954569|  4.92646032432|
|-0.952144934546|  -2.9114138684|  2.99204980215| -4.64479019591| -5.99952901402|
|  3.55670956201|-0.812146671595| -1.81243042667|  -1.0765836636|   4.9669633757|
| -2.28427448245| 0.982018554172|   2.2453332695|  1.02432988704| -7.42272905399|
|   5.5901346625|   9.7266134961| 0.372411854139|  4.62762920616| -7.39599025974|
|  9.54828822231| -2.99982461624|  2.17542923571|  6.98459564802|  4.17077742377|
| -6.93309333389|  6.54244346903| 0.783827506295|  4.51631424946|  5.14605443379|
| -1.39844067044|  5.94842772889| 0.270728638304|  4.71245951003|  7.60767471606|
| -7.45885401935| -2.17059549479|  9.13976371571| -7.59189334493|  -2.3924001937|
+---------------+---------------+---------------+---------------+---------------+

To do that I have to work with pyspark.ml.feature so this is how I am doing it

dataPCA = PCA(k=2, inputCol=str(data.columns), outputCol="pcaFeatures")
model = dataPCA.fit(data)

and I am getting this error:

pyspark.sql.utils.IllegalArgumentException: u'Field "[\'col0\', \'col1\', \'col2\', \'col3\', \'col4\']" does not exist.

what's wrong and how to fix that?

`inputColumn` refers to only one column - that should hold all the features, [e.g](https://spark.apache.org/docs/2.2.0/ml-features.html#pca) — mkaran, Nov 15 '17 at 11:36

user8944954 · Answer 1 · 2017-11-15T11:59:13.677

2

As mentioned by mkaran PCA requires a Vector column as an input. You have to assemble your data first, for example using VectorAsssembler or RFormula.

Please follow the examples in Encode and assemble multiple features in PySpark for details.

data = RFormula(formula=" ~ {0}".format(" + ".join(data.columns))).fit(data).transform(data)
dataPCA.setInputCol("features").fit(data).transform(data)

edited Nov 15 '17 at 11:59

answered Nov 15 '17 at 11:52

user8944954

21
2

Perform PCA on pyspark dataframe

1 Answers1