1

Trying to re-partition my dataframe in order to achieve parallelism. It was suggested to each partition size should be less than 128MB , in-order to achieve it I need to calculate how much the size of each row in my dataframe. So how to calculate/find how much each row size in my dataframe?

Thank you.

BdEngineer
  • 2,929
  • 4
  • 49
  • 85

1 Answers1

0

As discussed in the link that I have mentionned in my first comment, you can use java.lang.instrument

The solution that I propose is in Java, Maven and with Spark 2.4.0

You must have the following structure, otherwise you will have to adapt your pom.xml to your structure:

src
--main
----java
------size
--------Sizeof.java
------spark
--------SparkJavaTest.java
----resources
------META-INF
--------MANIFEST.MF

pom.xml

    <?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.formation.SizeOf</groupId>
    <artifactId>SizeOf</artifactId>
    <version>1.0-SNAPSHOT</version>

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-assembly-plugin</artifactId>
            <configuration>
                <archive>
                    <manifestFile>
                        src/main/resources/META-INF/MANIFEST.MF
                    </manifestFile>
                    <manifest>
                        <addClasspath>true</addClasspath>
                        <mainClass>
                            spark.SparkJavaTest
                        </mainClass>
                    </manifest>
                </archive>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id> <!-- this is used for inheritance merges -->
                    <phase>package</phase> <!-- bind to the packaging phase -->
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>
    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
    </dependencies>
</project>

Sizeof

package size;

import java.lang.instrument.Instrumentation;

final public class Sizeof {
    private static Instrumentation instrumentation;

    public static void premain(String args, Instrumentation inst) {
        instrumentation = inst;
    }

    public static long sizeof(Object o) {
        return instrumentation.getObjectSize(o);
    }
}

SparkJavaTest

package spark;

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import size.Sizeof;

public class SparkJavaTest {
    public static SparkSession spark = SparkSession
            .builder()
            .appName("JavaSparkTest")
            .master("local")
            .getOrCreate();


    public static void main(String[] args) {

        Dataset<Row> ds = spark.read().option("header",true).csv("sample.csv");

        ds.show(false);
// Get the size of a Dataset
        System.out.println("size of ds " + Sizeof.sizeof(ds));

        JavaRDD dsToJavaRDD = ds.toJavaRDD();
// Get the size of a JavaRDD
        System.out.println("size of rdd" + Sizeof.sizeof(dsToJavaRDD));

    }
}

MANIFEST.MF

Manifest-Version: 1.0
Premain-Class: size.Sizeof
Main-Class: spark.SparkJavaTest

After that, you clean and package :

mvn clean package

Then you can run and get the size of your objects:

java -javaagent:target/SizeOf-1.0-SNAPSHOT-jar-with-dependencies.jar -jar target/SizeOf-1.0-SNAPSHOT-jar-with-dependencies.jar 
Driss NEJJAR
  • 872
  • 5
  • 22