18

I have csv file with about 5000 rows and 950 columns. First I load it to DataFrame:

val data = sqlContext.read
  .format(csvFormat)
  .option("header", "true")
  .option("inferSchema", "true")
  .load(file)
  .cache()

After that I search all string columns

val featuresToIndex = data.schema
  .filter(_.dataType == StringType)
  .map(field => field.name)

and want to index them. For that I create indexers for each string column

val stringIndexers = featuresToIndex.map(colName =>
  new StringIndexer()
    .setInputCol(colName)
    .setOutputCol(colName + "Indexed"))

and create pipeline

val pipeline = new Pipeline().setStages(stringIndexers.toArray)

But when I try to transform my initial dataframe with this pipeline

val indexedDf = pipeline.fit(data).transform(data)

I get StackOverflowError

16/07/05 16:55:12 INFO DAGScheduler: Job 4 finished: countByValue at StringIndexer.scala:86, took 7.882774 s
Exception in thread "main" java.lang.StackOverflowError
at scala.collection.immutable.Set$Set1.contains(Set.scala:84)
at scala.collection.immutable.Set$Set1.$plus(Set.scala:86)
at scala.collection.immutable.Set$Set1.$plus(Set.scala:81)
at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:22)
at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:20)
at scala.collection.generic.Growable$class.loop$1(Growable.scala:53)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:57)
at scala.collection.mutable.SetBuilder.$plus$plus$eq(SetBuilder.scala:20)
at scala.collection.TraversableLike$class.to(TraversableLike.scala:590)
at scala.collection.AbstractTraversable.to(Traversable.scala:104)
at scala.collection.TraversableOnce$class.toSet(TraversableOnce.scala:304)
at scala.collection.AbstractTraversable.toSet(Traversable.scala:104)
at org.apache.spark.sql.catalyst.trees.TreeNode.containsChild$lzycompute(TreeNode.scala:86)
at org.apache.spark.sql.catalyst.trees.TreeNode.containsChild(TreeNode.scala:86)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:280)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
...

What am I doing wrong? Thanks.

Martin
  • 22,212
  • 11
  • 70
  • 132
Andrew Tsibin
  • 258
  • 3
  • 11

3 Answers3

6

Most probably there is just not enough memory to keep all stack frames. I experienced something similar when trained RandomForestModel. The workaround that works for me is is to run my driver application (that's web service) with additional parameters:

-XX:ThreadStackSize=81920 -Dspark.executor.extraJavaOptions='-XX:ThreadStackSize=81920'
evgenii
  • 1,190
  • 1
  • 8
  • 21
  • I'm actually facing the same issue, how can I find out the default stack size? Besides, I see that it's increasing the stack size for the executor, not driver, is that correct? – HHH Mar 15 '17 at 20:09
  • @h.z. It's for both, because both parties are working together. ThreadStackSize is for driver. The for executors goes from executor.extraJavaOptions. I'm not sure if it's possible to measure the size, I just increased mine until it started to work. I assume for even bigger dataset it'd still fail. – evgenii Mar 20 '17 at 20:55
  • @evgenii where do we add those parameters in jupyter notebook ? :) In the spark initialization or somewhere else ? – PolarBear10 Jul 26 '18 at 10:07
2

Seems like I found the kind of solution - use spark 2.0. Previously, I used 1.6.2 - it was the latest version at the time of issue. I tried to use the preview version of 2.0, but there is also the problem reproduced.

Andrew Tsibin
  • 258
  • 3
  • 11
-3

The StackOverflowError in Java When a function call is invoked by a Java application, a stack frame is allocated on the call stack. The stack frame contains the parameters of the invoked method, its local parameters, and the return address of the method. The return address denotes the execution point from which, the program execution shall continue after the invoked method returns. If there is no space for a new stack frame then, the StackOverflowError is thrown by the Java Virtual Machine (JVM). The most common case that can possibly exhaust a Java application’s stack is recursion. In recursion, a method invokes itself during its execution. Recursion is considered as a powerful general-purpose programming technique, but must be used with caution, in order for the StackOverflowError to be avoided.

The possible solution is 1.By default, Spark use memory only RDD serialization. try with a persist on disk option

2.to try to increase the driver's JVM stack size, adding something like -Xss5m to the driver options. It is likely that some recursing is happening when you are checking the column types in the data.schema

--driver-java-options "-Xss 100M"

if possible share the file and complete exception trace.

Gangadhar Kadam
  • 536
  • 1
  • 4
  • 15