Improve PySpark DataFrame.show output to fit Jupyter notebook

Question

Using PySpark in a Jupyter notebook, the output of Spark's DataFrame.show is low-tech compared to how Pandas DataFrames are displayed. I thought "Well, it does the job", until I got this:

The output is not adjusted to the width of the notebook, so that the lines wrap in an ugly way. Is there a way to customize this? Even better, is there a way to get output Pandas-style (without converting to pandas.DataFrame obviously)?

Two workarounds: Maybe you could try to expand your Jupyter Notebook cell like the accepted answer at https://stackoverflow.com/questions/21971449/how-do-i-increase-the-cell-width-of-the-jupyter-ipython-notebook-in-my-browser or to use `df.show(vertical=True)` as you can see in the example at `def show(self, n=20, truncate=True, vertical=False)` in the source code https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py — titiro89, May 25 '18 at 13:03

score 27 · Answer 1 · answered Nov 30 '18 at 21:42

27

This is now possible natively as of Spark 2.4.0 by setting spark.sql.repl.eagerEval.enabled to True:

answered Nov 30 '18 at 21:42

Kyle Barron

2,452
22
17

4

This does not appear to work for me on my own dataset which has a lot of columns. `spark.conf.set("spark.sql.repl.eagerEval.enabled",True)` followed by `df.limit(10)` – Reddspark Apr 02 '19 at 22:22
1

This would be good if it worked, which it does not on `2.4.3`, apparently. – ijoseph May 06 '20 at 22:53
1

This will load the entire dataset into your driver which may not be desired. – Luis Meraz Aug 18 '20 at 19:33
You may also configure this during session-creation: `spark = SparkSession.builder.config("spark.sql.repl.eagerEval.enabled", True).getOrCreate()` – Kim Jun 14 '22 at 07:52

score 27 · Answer 2 · answered Apr 02 '19 at 22:30

27

After playing around with my table which has a lot of columns I decided the best thing to do to get a feel for the data is to use:

df.show(n=5, truncate=False, vertical=True)

This displays it vertically without truncation and is the cleanest viewing I can come up with.

answered Apr 02 '19 at 22:30

Reddspark

6,934
9
47
64

4

The output from your code is kind of better than the horizontal view for me because it doesn't hide any columns. – hui chen Jan 16 '20 at 14:18
2

This is perfect. – ijoseph May 06 '20 at 22:53

score 8 · Answer 3 · answered Aug 18 '20 at 20:39

8

You can use an html magic command. Check the CSS selector is correct by inspecting the output cell. Then edit below accordingly and run it in a cell.

%%html
<style>
div.output_area pre {
    white-space: pre;
}
</style>

answered Aug 18 '20 at 20:39

Luis Meraz

2,356
1
12
8

genius! --------- – Aliostad Feb 16 '21 at 20:49

score 1 · Answer 4 · answered May 18 '20 at 10:41

Adding to the answers given above by @karan-singla and @vijay-jangir given in pyspark show dataframe as table with horizontal scroll in ipython notebook, a handy one-liner to comment out the white-space: pre-wrap styling can be done like so:

$ awk -i inplace '/pre-wrap/ {$0="/*"$0"*/"}1' $(dirname `python -c "import notebook as nb;print(nb.__file__)"`)/static/style/style.min.css

This translates as; use awk to update inplace lines that contain pre-wrap to be surrounded by */ -- */ i.e. comment out, on the file found in styles.css found in your working Python environment.

This, in theory, can then be used as an alias if one uses multiple environments, say with Anaconda.

REFs:

Jung Hoon Son · Answer 5 · 2022-09-04T02:50:24.973

I can only attest to VS code's Jupyter output - but default behavior garbles/"word-wraps" spark dataframes the same way. At least in VS Code, one you can edit the notebook's default CSS using HTML() module from IPython.core.display

This command will override default Jupyter cell output style to prevent 'word-wrap' behavior for spark dataframes.

Just run this code snippet in a cell (in VS Code, it hot-fixes the issue even if you have the output already displayed).

from IPython.core.display import HTML

HTML("""<style>
    .output-plaintext, .output-stream, .output{
        white-space: pre !important;
        font-family: Monaco; # Any monospaced font should work
    }</style>""")

johnnyheineken · Answer 6 · 2023-01-25T12:46:18.913

1

If you are using SparkSession.builder, I recommend to set option "spark.sql.repl.eagerEval.enabled" to True

spark = SparkSession.builder.config("spark.sql.repl.eagerEval.enabled", True).getOrCreate()

afterwards, to show the formatted table

df = spark.sql("select * from my_table")
df

you have to show the df like df, not df.show()

works with pyspark 2.4.0

edited Jan 25 '23 at 12:46

answered Jan 25 '23 at 12:33

johnnyheineken

543
7
20

score 0 · Answer 7 · answered Oct 11 '21 at 16:01

0

Just try this:

df.show(truncate=False)

answered Oct 11 '21 at 16:01

Talha Tayyab

8,111
25
27
44

Isn't this effectively the same answer as [the top-voted answer from three years ago](https://stackoverflow.com/a/55484426/3025856), but with less explanation? The previous answer uses a couple of additional arguments, but the core guidance is the same. – Jeremy Caney Oct 11 '21 at 19:43

Improve PySpark DataFrame.show output to fit Jupyter notebook

7 Answers7

Linked

Related