Using collect() for FP Growth in spark on large datasets

Question

I'm using the following code to generate association rules in FP Growth algorithm.

model.generateAssociationRules(minConfidence).collect().foreach { rule =>
println(
rule.antecedent.mkString("[", ",", "]")
  + " => " + rule.consequent .mkString("[", ",", "]")
  + ", " + rule.confidence)
}

But whenever i'm trying to run the algorithm on a big data table with 100 million records, it fails with java heap space error.

What is the alternative of using the collect() method for executing the FP growth algorithm on big data datasets?

I'm using spark 1.6.2 with scala 2.10

Solution code

val parts1 = model.freqItemsets.partitions
  parts1.map(p => {
    val idx1 = p.index
    val partRdd1 = model.freqItemsets.mapPartitionsWithIndex {
      case(index:Int,value:Iterator[FPGrowth.FreqItemset[String]]) =>
        if (index == idx1) value else Iterator()}
    val dataPartitioned = partRdd1.collect().foreach{itemset =>
      MasterMap1(itemset.items.mkString(",").replace(" ","")) = (itemset.freq / size).toString }
  })

I was able to execute my code using the suggestion provided in another post : http://stackoverflow.com/questions/21698443/spark-best-practice-for-retrieving-big-data-from-rdd-to-local-machine — Babloo Manohar Rajkumar, Jan 25 '17 at 13:27

score -1 · Answer 1 · answered Jan 20 '17 at 12:57

Try to increase driver memory if applicable. If you are running your app on yarn then it would be better to configure memory for driver according container memory which driver heap memory + memory over head (15% of heap memory) = (should be) yarn container memory

Using collect() for FP Growth in spark on large datasets

1 Answers1