30

Using PySpark in a Jupyter notebook, the output of Spark's DataFrame.show is low-tech compared to how Pandas DataFrames are displayed. I thought "Well, it does the job", until I got this:

enter image description here

The output is not adjusted to the width of the notebook, so that the lines wrap in an ugly way. Is there a way to customize this? Even better, is there a way to get output Pandas-style (without converting to pandas.DataFrame obviously)?

clstaudt
  • 21,436
  • 45
  • 156
  • 239
  • you could just convert the first 5 rows to pandas df – mtoto May 25 '18 at 07:57
  • 4
    `df.limit(5).toPandas()` – phi May 25 '18 at 08:06
  • 3
    Two workarounds: Maybe you could try to expand your Jupyter Notebook cell like the accepted answer at https://stackoverflow.com/questions/21971449/how-do-i-increase-the-cell-width-of-the-jupyter-ipython-notebook-in-my-browser or to use `df.show(vertical=True)` as you can see in the example at `def show(self, n=20, truncate=True, vertical=False)` in the source code https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py – titiro89 May 25 '18 at 13:03

7 Answers7

27

This is now possible natively as of Spark 2.4.0 by setting spark.sql.repl.eagerEval.enabled to True:

enter image description here

Kyle Barron
  • 2,452
  • 22
  • 17
  • 4
    This does not appear to work for me on my own dataset which has a lot of columns. `spark.conf.set("spark.sql.repl.eagerEval.enabled",True)` followed by `df.limit(10)` – Reddspark Apr 02 '19 at 22:22
  • 1
    This would be good if it worked, which it does not on `2.4.3`, apparently. – ijoseph May 06 '20 at 22:53
  • 1
    This will load the entire dataset into your driver which may not be desired. – Luis Meraz Aug 18 '20 at 19:33
  • You may also configure this during session-creation: `spark = SparkSession.builder.config("spark.sql.repl.eagerEval.enabled", True).getOrCreate()` – Kim Jun 14 '22 at 07:52
27

After playing around with my table which has a lot of columns I decided the best thing to do to get a feel for the data is to use:

df.show(n=5, truncate=False, vertical=True)

This displays it vertically without truncation and is the cleanest viewing I can come up with.

Reddspark
  • 6,934
  • 9
  • 47
  • 64
8

You can use an html magic command. Check the CSS selector is correct by inspecting the output cell. Then edit below accordingly and run it in a cell.

%%html
<style>
div.output_area pre {
    white-space: pre;
}
</style>
Luis Meraz
  • 2,356
  • 1
  • 12
  • 8
1

Adding to the answers given above by @karan-singla and @vijay-jangir given in pyspark show dataframe as table with horizontal scroll in ipython notebook, a handy one-liner to comment out the white-space: pre-wrap styling can be done like so:

$ awk -i inplace '/pre-wrap/ {$0="/*"$0"*/"}1' $(dirname `python -c "import notebook as nb;print(nb.__file__)"`)/static/style/style.min.css

This translates as; use awk to update inplace lines that contain pre-wrap to be surrounded by */ -- */ i.e. comment out, on the file found in styles.css found in your working Python environment.

This, in theory, can then be used as an alias if one uses multiple environments, say with Anaconda.

REFs:

tallamjr
  • 1,272
  • 1
  • 16
  • 21
1

I can only attest to VS code's Jupyter output - but default behavior garbles/"word-wraps" spark dataframes the same way. At least in VS Code, one you can edit the notebook's default CSS using HTML() module from IPython.core.display

This command will override default Jupyter cell output style to prevent 'word-wrap' behavior for spark dataframes.

Just run this code snippet in a cell (in VS Code, it hot-fixes the issue even if you have the output already displayed).

from IPython.core.display import HTML

HTML("""<style>
    .output-plaintext, .output-stream, .output{
        white-space: pre !important;
        font-family: Monaco; # Any monospaced font should work
    }</style>""")
Jung Hoon Son
  • 11
  • 1
  • 4
1

If you are using SparkSession.builder, I recommend to set option "spark.sql.repl.eagerEval.enabled" to True

spark = SparkSession.builder.config("spark.sql.repl.eagerEval.enabled", True).getOrCreate()

afterwards, to show the formatted table

df = spark.sql("select * from my_table")
df

you have to show the df like df, not df.show()

works with pyspark 2.4.0

johnnyheineken
  • 543
  • 7
  • 20
0

Just try this:

df.show(truncate=False)
Talha Tayyab
  • 8,111
  • 25
  • 27
  • 44
  • Isn't this effectively the same answer as [the top-voted answer from three years ago](https://stackoverflow.com/a/55484426/3025856), but with less explanation? The previous answer uses a couple of additional arguments, but the core guidance is the same. – Jeremy Caney Oct 11 '21 at 19:43