2

I have a csv file stored a data of user-item of dimension 6,365x214 , and i am finding user-user similarity by using columnSimilarities() of org.apache.spark.mllib.linalg.distributed.CoordinateMatrix.

My code looks like this:

import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.linalg.distributed.{RowMatrix, 
MatrixEntry, CoordinateMatrix}
import org.apache.spark.rdd.RDD

def rddToCoordinateMatrix(input_rdd: RDD[String]) : CoordinateMatrix = {

    // Convert RDD[String] to RDD[Tuple3]
    val coo_matrix_input: RDD[Tuple3[Long,Long,Double]] = input_rdd.map(
        line => line.split(',').toList
    ).map{
            e => (e(0).toLong, e(1).toLong, e(2).toDouble)
    }

    // Convert RDD[Tuple3] to RDD[MatrixEntry]
    val coo_matrix_matrixEntry: RDD[MatrixEntry] = coo_matrix_input.map(e => MatrixEntry(e._1, e._2, e._3))

    // Convert RDD[MatrixEntry] to CoordinateMatrix
    val coo_matrix: CoordinateMatrix  = new CoordinateMatrix(coo_matrix_matrixEntry)

    return coo_matrix
}

// Read CSV File to RDD[String]
val input_rdd: RDD[String] = sc.textFile("user_item.csv")

// Read RDD[String] to CoordinateMatrix
val coo_matrix = rddToCoordinateMatrix(input_rdd)

// Transpose CoordinateMatrix
val coo_matrix_trans = coo_matrix.transpose()

// Convert CoordinateMatrix to RowMatrix
val mat: RowMatrix = coo_matrix_trans.toRowMatrix()

// Compute similar columns perfectly, with brute force
// Return CoordinateMatrix
val simsPerfect: CoordinateMatrix = mat.columnSimilarities()

// CoordinateMatrix to RDD[MatrixEntry]
val simsPerfect_entries = simsPerfect.entries

simsPerfect_entries.count()

// Write results to file
val results_rdd = simsPerfect_entries.map(line => line.i+","+line.j+","+line.value)

results_rdd.saveAsTextFile("similarity-output")

// Close the REPL terminal
System.exit(0)

and, when i run this script on spark-shell i got following error, after running line of code simsPerfect_entries.count() :

java.lang.OutOfMemoryError: GC overhead limit exceeded

Updated:

I tried many solutions already given by others ,but i got no success.

1 By increasing amount of memory to use per executor process spark.executor.memory=1g

2 By decreasing the number of cores to use for the driver process spark.driver.cores=1

Suggest me some way to resolve this issue.

Akshay Pratap Singh
  • 3,197
  • 1
  • 24
  • 33
  • "I tried many solutions already given by others ,but i got no success." you should list which ones so we can avoid redundant answers. – the8472 May 15 '15 at 14:13
  • Akshay seems like I am facing the same problem Here is questionhttp://stackoverflow.com/q/37958522/1662775. I tried increasing driver memory but no luck. – Baradwaj Aryasomayajula Jun 22 '16 at 16:17

1 Answers1

6

All Spark transformations are lazy until you actually materialize it. When you define RDD-to-RDD data manipulations, Spark just chains operations together, not performing actual computation. So when you call simsPerfect_entries.count(), the chain of operations is executed and you got your number.

Error GC overhead limit exceeded means that JVM garbage collector activity was so high that execution of your code was stopped. GC activity can be so high due to these reasons:

  • You produce too many small objects and immediately discarding them. Looks like that you're not.
  • Your data does not fit into your JVM heap. Like if you try to load 2GB text file into RAM, but have only 1GB of JVM heap. Looks like that it's your case.

To fix this issue try to increase the amount of JVM heap on:

  • your worker nodes if you have a distributed Spark setup.
  • your spark-shell app.
shutty
  • 3,298
  • 16
  • 27
  • 2
    Thanks for your answer. I solved my issue by appending an extra flag for increasing `driver-memory` in spark-shell, by default it is **1g**. So. i increased it to **4g**. *$ spark-shell --driver-memory 4g* – Akshay Pratap Singh May 15 '15 at 22:51
  • how to solve in your reason 1. **You produce too many small objects and immediately discarding them**. Whats the solution in this case? – Programmer Aug 26 '17 at 20:43
  • When you're considering that you have too many small objects allocated+discarded, it's better to measure it at first (because it's a really low chance). You can create a separate question describing your case with examples and all the clues. – shutty Aug 28 '17 at 07:59