how to display full output in jupyter not only last result - for aws emr pyspark

Question

I would like to have same option as mentioned in this question: How to display full output in Jupyter, not only last result? but for AWS EMR's jupyterhub's pyspark kernel (Spark 2.4.4). It works with python3 (python3.6) kernel.

It works if I use print statements, but in that case, it doesn't work if last step is failed, then it will only show result for the failed step as shown in the image below.

Also, to note, not sure if it is related, but, below code doesn't run in sync i.e. print wait print wait...., but, it just prints everything at once at the end.

import time
for i in range(0,10):
    print(i)
    time.sleep(2)

Just adding the question from the referred post, if in case the referred question/post gets deleted or changes.

I want Jupyter to print all the interactive output without resorting to print, not only the last result. How to do it?

Example :

a=3
a
a+1

I would like to display

3
4

score 0 · Accepted Answer · answered Aug 30 '20 at 08:52

The print statement output goes to the stdout or stderr on the computer that is running a spark executor.

Considering you have a big cluster having n workers(each storing partition of an RDD or DataFrame). Is is hard to expect the ordered output in a job (for instance map). This may be considered a design choice as well for spark itself. Where will that data be printed out? since nodes are running code in parallel, which of them will be printed first?

So, we dont have interactive print statements inside jobs. These whole thing can also remind you about why we had accumulators and broadcast variables.

So, I will advice you to use logs generated by steps instead and work with logs. To view logs in Amazon S3, cluster logging must be enabled (which is the default for new clusters). View Log Files Archived to Amazon S3.

For your second question about sleep() and print, python is line buffered, which forces it to wait for a newline before printing to stdout. If the output is not a console, then even newline won't trigger a flush.

You can force the behaviour as

import time
for i in range(0,10):
    print(i,flush=True)
    time.sleep(2)

How did flush work for you in pyspark kernel? did it print 1 line at a time or all at once. For me it printed all at once. For 1st part, I understand log is our option, we use it for prod, but it becomes tedious for dev especially for troubleshooting also sometimes you want to see data and log is not right idea for it.I know Hue does print it properly like the way we need, but, its GUI is not as nice as Jupyter plus it keeps on refreshing while we are busy typing and many times looses the changes. P.S. I will wait for some more time, if I get any other responses else I will accept this answer. — Lavesh, Aug 30 '20 at 10:19

how to display full output in jupyter not only last result - for aws emr pyspark

1 Answers1