Spark print dataframe Without Running Out Of Memory

Question

How do I print an entire dataframe in Java without running out of memory?

Dataset<Row> df = ...

I know that:

df.show()

Will show the data frame, but with a large enough data frame it is possible this could run out of memory.

I know I can limit the content using:

df.show(rowCount, false)

But want to print an entire dataframe, I do not want to limit the contents...

I have tried:

df.foreachPartition(iter -> {
    while(iter.hasNext()){
       System.out.println(rowIter.next().mkString(",");)
     }
});

But this will print on each of the respective nodes, not on the driver...

If there any way I can print everything in the driver without running out of memory?

score 1 · Answer 1 · answered Mar 12 '19 at 16:31

AFAIK, The idea of printing the data frame is to see data.

Printing large dataframe is not recommended based on dataframe size out of memory is possible.

I'd offer below ways, if you want to see the contents then you can save in hive table and query the content. or write in to csv or json which is readable

Examples:

1) save in a hive table

df.write.mode("overwrite").saveAsTable("database.tableName")

later query from hive table.

2) csv or json

df.write.csv("/your/location/data.csv")
 df.write.json("/your/location/data.json")

the above will generate multiple part files if you want single file use coalesce(1) (but this will again move data to one node which is discouraged unless until you absolutely need it)

Other option is to print row by row using toLocalIterator see here which will also transfer the data to node... hence its not good idea

score 1 · Accepted Answer · answered Mar 12 '19 at 19:26

You will have to bring all the data to the driver, which will suck your memory a bit :(...

A solution could be to split your dataframe and print pieces by pieces in the driver. Of course, it depends on the structure of the data itself, it would look like:

long count = df.count();
long inc = count / 10;
for (long i = 0; i < count; i += inc) {
  Dataset<Row> filteredDf =
      df.where("id>=" + i + " AND id<" + (i + inc));

  List<Row> rows = filteredDf.collectAsList();
  for (Row r : rows) {
    System.out.printf("%d: %s\n", r.getAs(0), r.getString(1));
  }
}

I split the dataset in 10, but I know that my ids are from 1 to 100...

The full example could be:

package net.jgp.books.sparkWithJava.ch20.lab900_splitting_dataframe;

import java.util.ArrayList;
import java.util.List;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

/**
 * Splitting a dataframe to bring it back to the driver for local
 * processing.
 * 
 * @author jgp
 */
public class SplittingDataframeApp {

  /**
   * main() is your entry point to the application.
   * 
   * @param args
   */
  public static void main(String[] args) {
    SplittingDataframeApp app = new SplittingDataframeApp();
    app.start();
  }

  /**
   * The processing code.
   */
  private void start() {
    // Creates a session on a local master
    SparkSession spark = SparkSession.builder()
        .appName("Splitting a dataframe to collect it")
        .master("local")
        .getOrCreate();

    Dataset<Row> df = createRandomDataframe(spark);
    df = df.cache();

    df.show();
    long count = df.count();
    long inc = count / 10;
    for (long i = 0; i < count; i += inc) {
      Dataset<Row> filteredDf =
          df.where("id>=" + i + " AND id<" + (i + inc));

      List<Row> rows = filteredDf.collectAsList();
      for (Row r : rows) {
        System.out.printf("%d: %s\n", r.getAs(0), r.getString(1));
      }
    }
  }

  private static Dataset<Row> createRandomDataframe(SparkSession spark) {
    StructType schema = DataTypes.createStructType(new StructField[] {
        DataTypes.createStructField(
            "id",
            DataTypes.IntegerType,
            false),
        DataTypes.createStructField(
            "value",
            DataTypes.StringType,
            false) });

    List<Row> rows = new ArrayList<Row>();
    for (int i = 0; i < 100; i++) {
      rows.add(RowFactory.create(i, "Row #" + i));
    }
    Dataset<Row> df = spark.createDataFrame(rows, schema);
    return df;
  }
}

Do you think that can help?

It is not as elegant as saving it in a database, but it allows to avoid an additional component in your architecture. This code is not very generic, I am not sure you could make it generic in the current version of Spark.

Spark print dataframe Without Running Out Of Memory

2 Answers2