Find size of data stored in rdd from a text file in apache spark

Question

I am new to Apache Spark (version 1.4.1). I wrote a small code to read a text file and stored its data in Rdd .

Is there a way by which I can get the size of data in rdd .

This is my code :

import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.util.SizeEstimator
import org.apache.spark.sql.Row

object RddSize {

  def main(args: Array[String]) {

    val sc = new SparkContext("local", "data size")
    val FILE_LOCATION = "src/main/resources/employees.csv"
    val peopleRdd = sc.textFile(FILE_LOCATION)

    val newRdd = peopleRdd.filter(str => str.contains(",M,"))
    //Here I want to find whats the size remaining data
  }
}

I want to get size of data before filter Transformation (peopleRdd) and after it (newRdd).

What do you mean by "size"? Number of rows in the RDD? If so the "count" RDD function does that - `def count(): Long Return the number of elements in the RDD.`, from the Spark doc. — The Archetypal Paul, Aug 24 '15 at 10:25
@Paul No , here size dosnt mean no of rows . Suppose my file is of size 100MB then i got file data in rdd and applied filter . Data must have reduced in its size . I want to get that size (in MB) — bob, Aug 24 '15 at 10:38
Not sure you want to do that. RDDs are lazy so nothing has been executed yet for `newRDD`. If you want the size, you'll force it to be evaluated and probably do too much work., — The Archetypal Paul, Aug 24 '15 at 10:40
Thanks for replying @Paul ,I am newbie in Spark . I have no idea how i can force it to evaluate the size . Can you give some suggestions . — bob, Aug 24 '15 at 10:45
You don't want to force it to evaluate the size.but to evaluate the answer overall. Why do you care what the size is? — The Archetypal Paul, Aug 24 '15 at 11:57
@Paul I am working on an application which takes a file and filter some data . In response i want to show the size of data which is actually useful after processing a large file (can be of 1GB). — bob, Aug 24 '15 at 12:09
Sorry, again. Why? Unless that's the final answer and no more processing is needed. In which case, write it to a file and look at the file size.I can't see why knowing the size it consumes in memory is of interest. — The Archetypal Paul, Aug 24 '15 at 13:13
Actually that's not the final answer. Depends on the user , if user wants @Paul they can do some more filter based upon the remaining size . For that i need to show the initial size and remaining size . Once the transformations are done data has to be stored in spark sql (this part i know, I am stuck at finding the size ). — bob, Aug 24 '15 at 14:35
Please explain further why you need the size for the remaining filtering. Spark works (and gains its performance) by being 'lazy' and only doing the actual computation when the result is needed. So, normally, there is no "remaining" size in the middle of a computation, because the computation hasn't been done yet. So please explain what you are trying to do overall - because very probably, knowing the size in the middle is not necessary, or will negatively impact performance. — The Archetypal Paul, Aug 24 '15 at 15:02
@Paul Suppose I have a file of 100mb . I read it into RDD now I did some transformations on RDD . After that I created a table in spark sql from that RDD . I want to know whats the size of data (increased or decreased) in RDD just before creating table in spark sql . I hope my issue is clear now . Thanks a lot for looking into my problem :) . Please let me know even its possible or not . — bob, Aug 24 '15 at 16:53
No, it's not clear. WHY do you want to know the size? You keep telling me you want to know the size, but why? What are you going to do with the answer? If you want to know what the size of the table will be, in memory, before you create it, then no, I don't think it's possible. I'm not being delberately obtuse, I have no idea how knowing the RDD is 100Mb or 90Mb or 110Mb will help you in any way — The Archetypal Paul, Aug 24 '15 at 16:55
Spark sql table compresses the data . so that wont be helpful else i would have gone with in memory size . Thanks a lot for the help :) — bob, Aug 24 '15 at 17:04

score 9 · Accepted Answer · answered Aug 27 '15 at 19:06

9

There are multiple way to get the RDD size

1.Add the spark listener in your spark context

SparkDriver.getContext.addSparkListener(new SparkListener() {
override def onStageCompleted(stageCompleted: SparkListenerStageCompleted) {
  val map = stageCompleted.stageInfo.rddInfos
  map.foreach(row => {
      println("rdd memSize " + row.memSize)
      println("rdd diskSize " + row.diskSize)
   })
}})

2. Save you rdd as text file.

myRDD.saveAsTextFile("person.txt")

and call Apache Spark REST API.

/applications/[app-id]/stages

3. You can also try SizeEstimater

val rddSize = SizeEstimator.estimate(myRDD)

answered Aug 27 '15 at 19:06

Gabber

7,169
3
32
46

Thanks !! Will try these and let you know in case of any issues – bob Aug 28 '15 at 05:00
@Gabber : Nice explanation :) – Ram Ghadiyaram Oct 13 '16 at 17:50
thanx @RamPrasadG – Gabber Oct 14 '16 at 05:07

score 4 · Answer 2 · answered Aug 26 '15 at 18:00

I'm not sure you need to do this. You could cache the rdd and check the size in the Spark UI. But lets say that you do want to do this programmatically, here is a solution.

    def calcRDDSize(rdd: RDD[String]): Long = {
        //map to the size of each string, UTF-8 is the default
        rdd.map(_.getBytes("UTF-8").length.toLong) 
           .reduce(_+_) //add the sizes together
    }

You can then call this function for your two RDDs:

println(s"peopleRdd is [${calcRDDSize(peopleRdd)}] bytes in size")
println(s"newRdd is [${calcRDDSize(newRdd)}] bytes in size")

This solution should work even if the file size is larger than the memory available in the cluster.

I don’t want to cache the RDD . It will get data in memory with is not required . — bob, Aug 27 '15 at 15:32

score 0 · Answer 3 · answered Aug 27 '15 at 07:44

0

The Spark API doc says that:

You can get info about your RDDs from the Spark context: sc.getRDDStorageInfo
The RDD info includes memory and disk size: RDDInfo doc

answered Aug 27 '15 at 07:44

Little Bobby Tables

5,261
2
39
49

Find size of data stored in rdd from a text file in apache spark

3 Answers3

Linked