50

I know how to find the file size in scala.But how to find a RDD/dataframe size in spark?

Scala:

object Main extends App {
  val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString()
  println(file.length)
}

Spark:

val distFile = sc.textFile(file)
println(distFile.length)

but if i process it not getting file size. How to find the RDD size?

mrsrinivas
  • 34,112
  • 13
  • 125
  • 125
Venu A Positive
  • 2,992
  • 2
  • 28
  • 31

3 Answers3

77

If you are simply looking to count the number of rows in the rdd, do:

val distFile = sc.textFile(file)
println(distFile.count)

If you are interested in the bytes, you can use the SizeEstimator:

import org.apache.spark.util.SizeEstimator
println(SizeEstimator.estimate(distFile))

https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html

Fokko Driesprong
  • 2,075
  • 19
  • 31
Glennie Helles Sindholt
  • 12,816
  • 5
  • 44
  • 50
  • Thanks its working, when i imported this import org.apache.spark.util.SizeEstimator but not get exact value.always around 43MB – Venu A Positive Jan 28 '16 at 07:25
  • 2
    BTW, if everything is working, can you then mark the question as answered :) – Glennie Helles Sindholt Jan 28 '16 at 08:13
  • Please find my problem here. http://sparkdeveloper.blogspot.in/2016/01/spark-solution-please.html – Venu A Positive Jan 28 '16 at 11:36
  • 2
    Oh, I see - I missed the "always around 43MB"-part. But if you're not interested in size that the `dataframe` takes up in memory and just want the size of the file on disk, why don't you just use regular file utils? – Glennie Helles Sindholt Jan 28 '16 at 12:16
  • @GlennieHellesSindholt How the regular file utils would work for parquet since it will not give me correct size? – Anand Oct 23 '18 at 08:36
  • 1
    Regular file utils will tell you the physical size on disk of any given file - it doesn't matter if it is a parquet, gzipped or packed in any other way. Which file utils are you using that does not give you the correct size? – Glennie Helles Sindholt Oct 25 '18 at 08:57
  • If you use `SizeEstimator` for cache consumption estimations, be aware that it will give the objects bytes in deserialized form. Which is not the same as the serialized size of the object, which will typically be much smaller. – belgacea Nov 25 '19 at 15:32
13

Yes Finally I got the solution. Include these libraries.

import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd

How to find the RDD Size:

def calcRDDSize(rdd: RDD[String]): Long = {
  rdd.map(_.getBytes("UTF-8").length.toLong)
     .reduce(_+_) //add the sizes together
}

Function to find DataFrame size: (This function just convert DataFrame to RDD internally)

val dataFrame = sc.textFile(args(1)).toDF() // you can replace args(1) with any path

val rddOfDataframe = dataFrame.rdd.map(_.toString())

val size = calcRDDSize(rddOfDataframe)
sumitya
  • 2,631
  • 1
  • 19
  • 32
Venu A Positive
  • 2,992
  • 2
  • 28
  • 31
  • 15
    If you convert a dataframe to RDD you increase its size considerably. Dataframe uses project tungsten for a much more efficient memory representation. If you just want to get an impression of the sizes you can cache both the RDD and the dataframe (make sure to materialize the caching by doing a count on it for example) and then look under the storage tab of the UI. Note that in either case you are getting the size in memory and not the file size – Assaf Mendelson Sep 20 '16 at 12:23
  • 14
    This answer is wrong. Conversions to string to compute size does not make any sense. In addition, `import org.apache.spark.util.SizeEstimator` is not used – mathieu Dec 20 '17 at 17:14
  • 1
    This will actually get you the size of flat text file if you would store dataframe in it. So pretty much what I am looking for. – Amit Kumar Jun 25 '18 at 10:55
  • @Venu A Positive I am using spark-sql 2.4.1v even after importing all imports show here i dont get _.getBytes methods , what else I need to import ? any pom.xml changes ? please suggest – BdEngineer Aug 16 '19 at 07:04
  • I've tried in my local Spark installation and it doesn't get exactly the same size than my operative system, however is the most approximate solution that I've found so far. Thank you! – mjbsgll Aug 06 '21 at 14:23
7

Below is one way apart from SizeEstimator.I use frequently

To know from code about an RDD if it is cached, and more precisely, how many of its partitions are cached in memory and how many are cached on disk? to get the storage level, also want to know the current actual caching status.to Know memory consumption.

Spark Context has developer api method getRDDStorageInfo() Occasionally you can use this.

Return information about what RDDs are cached, if they are in mem or on disk, how much space they take, etc.

For Example :

scala> sc.getRDDStorageInfo
       res3: Array[org.apache.spark.storage.RDDInfo] = 
       Array(RDD "HiveTableScan [name#0], (MetastoreRelation sparkdb, 
       firsttable, None), None " (3) StorageLevel: StorageLevel(false, true, false, true, 1);  CachedPartitions: 1;

TotalPartitions: 1; MemorySize: 256.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B)

Seems like spark ui also used the same from this code

Description
With SPARK-13992, Spark supports persisting data into off-heap memory, but the usage of off-heap is not exposed currently, it is not so convenient for user to monitor and profile, so here propose to expose off-heap memory as well as on-heap memory usage in various places:

  1. Spark UI's executor page will display both on-heap and off-heap memory usage.
  2. REST request returns both on-heap and off-heap memory.
  3. Also these two memory usage can be obtained programmatically from SparkListener.
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121