I have the following pyspark code named sample.py with print statement
import sys
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as f
from datetime import datetime
from time import time
if __name__ == '__main__':
spark = SparkSession.builder.appName("Test").enableHiveSupport().getOrCreate()
print("Print statement-1")
schema = StructType([
StructField("author", StringType(), False),
StructField("title", StringType(), False),
StructField("pages", IntegerType(), False),
StructField("email", StringType(), False)
])
data = [
["author1", "title1", 1, "author1@gmail.com"],
["author2", "title2", 2, "author2@gmail.com"],
["author3", "title3", 3, "author3@gmail.com"],
["author4", "title4", 4, "author4@gmail.com"]
]
df = spark.createDataFrame(data, schema)
print("Number of records",df.count())
sys.exit(0)
the below spark-submit with sample.log is not printing the print statement
spark-submit --master yarn --deploy-mode cluster sample.py > sample.log
The scenario is we want to print something information in the log file so that after the spark job completes based on that the print statement in log file we will do some other actions.
Please help me on this