44

I would like to display the entire Apache Spark SQL DataFrame with the Scala API. I can use the show() method:

myDataFrame.show(Int.MaxValue)

Is there a better way to display an entire DataFrame than using Int.MaxValue?

Yuri Brovman
  • 1,093
  • 2
  • 12
  • 17
  • 5
    Try `myDataFrame.show(false)`. Not sure if thats what you are looking for. – Pramit Sep 12 '16 at 19:52
  • Use RDD.toLocalIterator(), as discussed in this SO post: http://stackoverflow.com/questions/21698443/spark-best-practice-for-retrieving-big-data-from-rdd-to-local-machine – Mark Rajcok Sep 28 '16 at 18:13

7 Answers7

75

It is generally not advisable to display an entire DataFrame to stdout, because that means you need to pull the entire DataFrame (all of its values) to the driver (unless DataFrame is already local, which you can check with df.isLocal).

Unless you know ahead of time that the size of your dataset is sufficiently small so that driver JVM process has enough memory available to accommodate all values, it is not safe to do this. That's why DataFrame API's show() by default shows you only the first 20 rows.

You could use the df.collect which returns Array[T] and then iterate over each line and print it:

df.collect.foreach(println)

but you lose all formatting implemented in df.showString(numRows: Int) (that show() internally uses).

So no, I guess there is no better way.

Grega Kešpret
  • 11,827
  • 6
  • 39
  • 44
5

One way is using count() function to get the total number of records and use show(rdd.count()) .

FelixSFD
  • 6,052
  • 10
  • 43
  • 117
AkshayK
  • 305
  • 3
  • 4
4

Try with,

df.show(35, false)

It will display 35 rows and 35 column values with full values name.

Talha Tayyab
  • 8,111
  • 25
  • 27
  • 44
Suresh G
  • 147
  • 8
2

As others suggested, printing out entire DF is bad idea. However, you can use df.rdd.foreachPartition(f) to print out partition-by-partition without flooding driver JVM (y using collect)

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
ayan guha
  • 1,249
  • 10
  • 7
  • Can you please provide some sample code? Won't the print statements inside the `f()` function print to the workers' stdout, and not the driver/your shell session's stdout? See also http://stackoverflow.com/a/28804763/215945 – Mark Rajcok Sep 27 '16 at 20:18
1

Nothing more succinct than that, but if you want to avoid the Int.MaxValue, then you could use a collect and process it, or foreach. But, for a tabular format without much manual code, show is the best you can do.

Justin Pihony
  • 66,056
  • 18
  • 147
  • 180
0

In java I have tried it with two ways. This is working perfectly for me:

1.

data.show(SomeNo);

2.

data.foreach(new ForeachFunction<Row>() {
                public void call(Row arg0) throws Exception {
                    System.out.println(arg0);
                }
            });
maRtin
  • 6,336
  • 11
  • 43
  • 66
Rajeev Rathor
  • 1,830
  • 25
  • 20
-3

I've tried show() and it seems working sometimes. But sometimes not working, just give it a try:

println(df.show())
keypoint
  • 2,268
  • 4
  • 31
  • 59