pySpark Java heap space using PCA

Question

I am trying to run PCA to reduce dimensionality for my features. It is currently right now 11272 features stored in a SparseVector Here is what I run :

from pyspark import SparkContext, SQLContext, SparkConf
conf = SparkConf()
conf.set("spark.executor.memory", "16g")
conf.set("spark.driver.memory", "64g")
conf.set("spark.core.connection.ack.wait.timeout", "3600")
conf.set("spark.driver.maxResultSize", "10g")

Trying to reduce features using PCA - using DF as input

   from pyspark.ml.feature import PCA as PCA
   pca = PCA(k=10, inputCol="total_features", outputCol="pca_features")
   model = pca.fit(outputDF2)

As you can see, I want to reduce the dimensionality to 10 for now. I also tried 100, 500 , 30 and 50

All of the tries result in the following error :

An error occurred while calling o1243.fit.
: java.lang.OutOfMemoryError: Java heap space

Traceback (most recent call last):
  File "/usr/hdp/current/spark2-client/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/usr/hdp/current/spark2-client/python/pyspark/ml/wrapper.py", line 213, in _fit
    java_model = self._fit_java(dataset)
  File "/usr/hdp/current/spark2-client/python/pyspark/ml/wrapper.py", line 210, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 319, in get_return_value
    format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o1243.fit.
: java.lang.OutOfMemoryError: Java heap space

It seems that you don't have enough memory to do PCA on your DF. You should try upping the spark.driver.memory allocation in your spark-defaults.conf file. [Source](http://stackoverflow.com/questions/32336915/pyspark-java-lang-outofmemoryerror-java-heap-space) — Scratch'N'Purr, Apr 24 '17 at 17:43
@Scratch'N'Purr do you have a way that I can compute a possible value for spark.driver.memory I already up it to 64GB??? — E B, Apr 24 '17 at 18:01
@E B, you did but I think you have to configure it outside of your application... quoting from one of the commentators from the source that I linked: `From docs: spark.driver.memory "Amount of memory to use for the driver process, i.e. where SparkContext is initialized. (e.g. 1g, 2g). Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option or in your default properties file."` — Scratch'N'Purr, Apr 24 '17 at 18:08
@Scratch'N'Purr I did configure it also outside of the python notebook and it does not seemed to work.. would you know how I can monitor the memory usage when running PCA — E B, Apr 24 '17 at 22:01
That would depend on your OS. If it's windows, its just pulling up the task manager and looking at the performance tab. If its a linux system, then you could use `htop` in terminal. — Scratch'N'Purr, Apr 25 '17 at 11:41
Sorry if this is silly or you are already doing it. Should you not do `sc.stop()` and `sc = SparkContext(conf=conf)` to reinitialize your context with the new values? Cheers. — lrnzcig, Apr 25 '17 at 19:55
I'm having this problem as well. In particular, my dataset is ~9000 rows of sparse vectors. The size of the pickled dataframe is well under a MB, but my spark driver memory is set for 16g and executor memory is set for 12g. Obviously something is amiss here, there is no way I blowing through GB of memory with the existing dataset. — kingledion, May 31 '18 at 18:21
I know its been quite a while but did you solve it? If yes, any hints? I am having the same problem with spark java. I have a dataframe with 4827 rows and 40.107 columns/features. — Des0lat0r, Oct 27 '20 at 07:14

pySpark Java heap space using PCA

Trying to reduce features using PCA - using DF as input

0 Answers0