2

I am new in apache spark sql in scala.

How can find size of each Row in Apache spark sql dataframe and discrad the rows having size more than a threshold size in Kilobyte. I am looking for scala solution.

Avishek Bhattacharya
  • 6,534
  • 3
  • 34
  • 53

1 Answers1

7

This is actually kind of a tricky problem. Spark SQL uses columnar data Storage so thinking of individual row sizes isn't super natural. We can of course call .rdd on from there you can filter the resulting RDD using the techniques as from Calculate size of Object in Java to determine the object size, and then you can take your RDD of Rows and convert it back to a DataFrame using your SQLContext.

Community
  • 1
  • 1
Holden
  • 7,392
  • 1
  • 27
  • 33