1

I am trying to fetch spark row size like this , by following this.

How can find size of each Row in Apache spark sql dataframe and discrad the rows having size more than a threshold size in Kilobyte

Converting to rdd gives a lot more issues, so I was trying to use toSeq and passing on to get object size.

private[spark] def getEventSize(row: ssql.Row): Long = {
  ObjectSizeFetcher.getObjectSize(row.toSeq)
}

Though it seems to print the data, but throws a Null Pointer exception for the same object

oWrappedArray(1, 1, 2, 2, 2.0, Map(a -> 1), a, a, 0, 1, Map(1 -> 1), 1, 1, 1.0, 0.0, 0, 1, 1.0)

Exception

java.lang.NullPointerException:
  at com.expediagroup.dataquality.polaris.batchprofiler.utils.ObjectSizeFetcher.getObjectSize(ObjectSizeFetcher.java:16)

I am using Instrumentation.getObjectSize to fetch the size of spark row

import java.lang.instrument.Instrumentation;

public class ObjectSizeFetcher {
    private static Instrumentation instrumentation;

    public static void premain(String args, Instrumentation inst) {
        instrumentation = inst;
    }

    public static long getObjectSize(Object o) {
        System.out.println("o" + o);
        if(o==null)
            return 0;
        return instrumentation.getObjectSize(o);
    }
}

Any help is appreciated

Akshay Hazari
  • 3,186
  • 4
  • 48
  • 84

1 Answers1

0

I used import SizeEstimator instead it seems to work for now

import org.apache.spark.util.SizeEstimator
.
.
.
private[spark] def getEventSize(row: ssql.Row): Long = {   
   SizeEstimator.estimate(row)
}
Akshay Hazari
  • 3,186
  • 4
  • 48
  • 84