By default, yarn:yarn.log-aggregation-enable
is set to true and yarn:yarn.nodemanager.remote-app-log-dir
is set to gs://<cluster-tmp-bucket>/<cluster-uuid>/yarn-logs
on Dataproc 1.5+, so YARN container logs are aggregated in the GCS dir, but you can update it with
gcloud dataproc clusters create ... \
--properties yarn:yarn.nodemanager.remote-app-log-dir=<gcs-dir>
or update the tmp bucket of the cluster with
gcloud dataproc clusters create ... --temp-bucket <bucket>
Note that
If your Spark job is in client mode (the default), the Spark driver runs on master node instead of in YARN, driver logs are stored in the Dataproc-generated job property driverOutputResourceUri
which is a job specific folder in the cluster's staging bucket. Otherwise, in cluster mode, the Spark driver runs in YARN, the driver logs are YARN container logs and are aggregated as described above.
If you want to disable Cloud Logging for your cluster, set dataproc:dataproc.logging.stackdriver.enable=false
. But note that it will disable all types of Cloud Logging logs including YARN container logs, startup and service logs.