0

I have the following pyspark code named sample.py with print statement

import sys
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as f
from datetime import datetime
from time import time

if __name__ == '__main__':
    spark = SparkSession.builder.appName("Test").enableHiveSupport().getOrCreate()
    print("Print statement-1")
    schema = StructType([
        StructField("author", StringType(), False),
        StructField("title", StringType(), False),
        StructField("pages", IntegerType(), False),
        StructField("email", StringType(), False)
    ])

    data = [
        ["author1", "title1", 1, "author1@gmail.com"],
        ["author2", "title2", 2, "author2@gmail.com"],
        ["author3", "title3", 3, "author3@gmail.com"],
        ["author4", "title4", 4, "author4@gmail.com"]
    ]

    df = spark.createDataFrame(data, schema)
    print("Number of records",df.count())
    sys.exit(0)

the below spark-submit with sample.log is not printing the print statement

spark-submit --master yarn --deploy-mode cluster sample.py > sample.log

The scenario is we want to print something information in the log file so that after the spark job completes based on that the print statement in log file we will do some other actions.

Please help me on this

  • print statement is printing the in the sample .log when using --deploy-mode is client but when using cluster mode it is not printing. I can understand that Client mode the print statment will work bcoz of the driver is created in the client machine but where as in cluster mode mode the driver will be in the cluster. But in above problem scenario how to print it in the sample.log – Surendiran Balasubramanian Aug 04 '22 at 18:23

1 Answers1

2

The print statements will not be found in the spark-submit logs but rather in the yarn logs. When you do spark-submit you will get an application ID which looks like this application_1234567890123_12345.

Now run the following command with the application Id to get the aggregated yarn logs after the spark job has completed.

yarn logs -applicationId <applicationId>
viggnah
  • 1,709
  • 1
  • 3
  • 12
  • Thank you for the response. When the spark-submit completes I can get the applicationId from the spark-submit output in our case sample.log. But I have another question when the job completes the YARN will kill all the containers right. If the containers are killed how the Yarn logs are available. Another question apart from spark-submit output. Is there anyother way to get applicationId in shell script where i have spark-submit command – Surendiran Balasubramanian Aug 05 '22 at 18:22
  • 1. The logs will be retained for `yarn.nodemanager.log.retain-seconds` (set in `yarn-site.xml`) in the name nodes. Also, when `yarn.log-aggregation-enable` is set to `true`, it will move to a central directory specified in `yarn.nodemanager.remote-app-log-dir`. Refer [this](https://www.davidmcginnis.net/post/a-crash-course-in-yarn-log-aggregation) for more details. 2. [This](https://stackoverflow.com/questions/34588192/how-to-get-applicationid-of-spark-application-deployed-to-yarn-in-scala) SO answer gives some options to get the app Id from a script. – viggnah Aug 06 '22 at 05:41