2

I am running a production job in databricks using cluster. During environment Initialization I have created a notebook which will include lot of print statements which is causing job cluster to exceed the output size and the job was failing.

I have tried to configure this parameter

spark.databricks.driver.disableScalaOutput true

Seems to be the above parameter does not working. Is there any other way we can tackle this issue.

code_bug
  • 355
  • 1
  • 12

1 Answers1

1

a notebook which will include lot of print statements which is causing job cluster to exceed the output size and the job was failing.

As per the given information I have tried to repro the issue by writing the 999 print per Notebook with total 3 notebooks. Means total 2997 print statements in a notebook and it is working fine for me. Please refer below image.

enter image description here

The possible reasons for the error could be:

  • If you are using multiple display(), displayHTML(), show() commands in your notebook, this increases the amount of output. Once the output exceeds 20 MB, the error occurs.

  • If you are using multiple print() commands in your notebook, this can increase the output to stdout. Once the output exceeds 20 MB, the error occurs.

  • If you are running a streaming job and enable awaitAnyTermination in the cluster’s Spark config, it tries to fetch the entire output in a single request. If this exceeds 20 MB, the error occurs.

So, make sure the total output shouldn’t exceed 20 MB.

Solution:

Remove any unnecessary display(), displayHTML(), print(), and show(), commands in your notebook. These can be useful for debugging, but they are not recommended for production jobs.

If your job output is exceeding the 20 MB limit, try redirecting your logs to log4j or disable stdout by setting spark.databricks.driver.disableScalaOutput true in the cluster’s Spark config.

Utkarsh Pal
  • 4,079
  • 1
  • 5
  • 14