How to find spark RDD/Dataframe size?

Question

I know how to find the file size in scala.But how to find a RDD/dataframe size in spark?

Scala:

object Main extends App {
  val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString()
  println(file.length)
}

Spark:

val distFile = sc.textFile(file)
println(distFile.length)

but if i process it not getting file size. How to find the RDD size?

Do you mean the number of rows in the `rdd` or the actual size in MBs (or GBs)? — Glennie Helles Sindholt, Jan 26 '16 at 07:07
let example, 50 MB file is input, i want to split it to 5. to do it first input rdd i need to find rdd size, but its not succeed. — Venu A Positive, Jan 28 '16 at 07:22

score 77 · Answer 1 · edited Aug 31 '17 at 07:08

77

If you are simply looking to count the number of rows in the rdd, do:

val distFile = sc.textFile(file)
println(distFile.count)

If you are interested in the bytes, you can use the SizeEstimator:

import org.apache.spark.util.SizeEstimator
println(SizeEstimator.estimate(distFile))

https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html

edited Aug 31 '17 at 07:08

Fokko Driesprong

2,075
19
31

answered Jan 26 '16 at 07:09

Glennie Helles Sindholt

12,816
5
44
50

Thanks its working, when i imported this import org.apache.spark.util.SizeEstimator but not get exact value.always around 43MB – Venu A Positive Jan 28 '16 at 07:25
2

BTW, if everything is working, can you then mark the question as answered :) – Glennie Helles Sindholt Jan 28 '16 at 08:13
Please find my problem here. http://sparkdeveloper.blogspot.in/2016/01/spark-solution-please.html – Venu A Positive Jan 28 '16 at 11:36
2

Oh, I see - I missed the "always around 43MB"-part. But if you're not interested in size that the `dataframe` takes up in memory and just want the size of the file on disk, why don't you just use regular file utils? – Glennie Helles Sindholt Jan 28 '16 at 12:16
@GlennieHellesSindholt How the regular file utils would work for parquet since it will not give me correct size? – Anand Oct 23 '18 at 08:36
1

Regular file utils will tell you the physical size on disk of any given file - it doesn't matter if it is a parquet, gzipped or packed in any other way. Which file utils are you using that does not give you the correct size? – Glennie Helles Sindholt Oct 25 '18 at 08:57
If you use `SizeEstimator` for cache consumption estimations, be aware that it will give the objects bytes in deserialized form. Which is not the same as the serialized size of the object, which will typically be much smaller. – belgacea Nov 25 '19 at 15:32

score 13 · Accepted Answer · edited Nov 20 '19 at 17:18

13

Yes Finally I got the solution. Include these libraries.

import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd

How to find the RDD Size:

def calcRDDSize(rdd: RDD[String]): Long = {
  rdd.map(_.getBytes("UTF-8").length.toLong)
     .reduce(_+_) //add the sizes together
}

Function to find DataFrame size: (This function just convert DataFrame to RDD internally)

val dataFrame = sc.textFile(args(1)).toDF() // you can replace args(1) with any path

val rddOfDataframe = dataFrame.rdd.map(_.toString())

val size = calcRDDSize(rddOfDataframe)

edited Nov 20 '19 at 17:18

sumitya

2,631
1
19
32

answered Feb 01 '16 at 03:25

Venu A Positive

2,992
2
28
31

15

If you convert a dataframe to RDD you increase its size considerably. Dataframe uses project tungsten for a much more efficient memory representation. If you just want to get an impression of the sizes you can cache both the RDD and the dataframe (make sure to materialize the caching by doing a count on it for example) and then look under the storage tab of the UI. Note that in either case you are getting the size in memory and not the file size – Assaf Mendelson Sep 20 '16 at 12:23
14

This answer is wrong. Conversions to string to compute size does not make any sense. In addition, `import org.apache.spark.util.SizeEstimator` is not used – mathieu Dec 20 '17 at 17:14
1

This will actually get you the size of flat text file if you would store dataframe in it. So pretty much what I am looking for. – Amit Kumar Jun 25 '18 at 10:55
@Venu A Positive I am using spark-sql 2.4.1v even after importing all imports show here i dont get _.getBytes methods , what else I need to import ? any pom.xml changes ? please suggest – BdEngineer Aug 16 '19 at 07:04
I've tried in my local Spark installation and it doesn't get exactly the same size than my operative system, however is the most approximate solution that I've found so far. Thank you! – mjbsgll Aug 06 '21 at 14:23

Ram Ghadiyaram · Answer 3 · 2019-11-28T06:25:51.623

Below is one way apart from SizeEstimator.I use frequently

To know from code about an RDD if it is cached, and more precisely, how many of its partitions are cached in memory and how many are cached on disk? to get the storage level, also want to know the current actual caching status.to Know memory consumption.

Spark Context has developer api method getRDDStorageInfo() Occasionally you can use this.

Return information about what RDDs are cached, if they are in mem or on disk, how much space they take, etc.

For Example :
scala> sc.getRDDStorageInfo
       res3: Array[org.apache.spark.storage.RDDInfo] = 
       Array(RDD "HiveTableScan [name#0], (MetastoreRelation sparkdb, 
       firsttable, None), None " (3) StorageLevel: StorageLevel(false, true, false, true, 1);  CachedPartitions: 1;
TotalPartitions: 1; MemorySize: 256.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B)

Seems like spark ui also used the same from this code

See this Source issue SPARK-17019 which describes...

Description
With SPARK-13992, Spark supports persisting data into off-heap memory, but the usage of off-heap is not exposed currently, it is not so convenient for user to monitor and profile, so here propose to expose off-heap memory as well as on-heap memory usage in various places:

Spark UI's executor page will display both on-heap and off-heap memory usage.

REST request returns both on-heap and off-heap memory.

Also these two memory usage can be obtained programmatically from SparkListener.

I am not seeing here how you get `MemorySize: 256.0 B` from `sc.getRDDStorageInfo` — user2739472, Aug 08 '18 at 10:58

How to find spark RDD/Dataframe size?

3 Answers3

For Example :

Seems like spark ui also used the same from this code

Linked

Related