I am trying to run PCA to reduce dimensionality for my features. It is currently right now 11272 features stored in a SparseVector Here is what I run :
from pyspark import SparkContext, SQLContext, SparkConf
conf = SparkConf()
conf.set("spark.executor.memory", "16g")
conf.set("spark.driver.memory", "64g")
conf.set("spark.core.connection.ack.wait.timeout", "3600")
conf.set("spark.driver.maxResultSize", "10g")
Trying to reduce features using PCA - using DF as input
from pyspark.ml.feature import PCA as PCA
pca = PCA(k=10, inputCol="total_features", outputCol="pca_features")
model = pca.fit(outputDF2)
As you can see, I want to reduce the dimensionality to 10 for now. I also tried 100, 500 , 30 and 50
All of the tries result in the following error :
An error occurred while calling o1243.fit.
: java.lang.OutOfMemoryError: Java heap space
Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/pyspark/ml/base.py", line 64, in fit
return self._fit(dataset)
File "/usr/hdp/current/spark2-client/python/pyspark/ml/wrapper.py", line 213, in _fit
java_model = self._fit_java(dataset)
File "/usr/hdp/current/spark2-client/python/pyspark/ml/wrapper.py", line 210, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 319, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o1243.fit.
: java.lang.OutOfMemoryError: Java heap space