What is the impact on memory When we override dataframes and Rdds in apache spark?

Asked Dec 22 '21 at 06:50

Active Dec 23 '21 at 05:29

Viewed 34 times

For example I am having a dataframe which needs some processing and conversions on the columns and I am overriding the existing dataframe again and again like the code is given below

var fd = (spark.read.format("csv")
.option("inferSchema", "false")
.option("header", "true")
.load(csvFile))

fd = fd.withColumn("date", col("date").cast("String"))

I am new to spark, so don't know any better approach to this kind of operation.

Any suggestion?

edited Dec 23 '21 at 05:29

asked Dec 22 '21 at 06:50

Zeeshan

1

I doubt there is a difference. In either case, unused objects will be garbage collected. But either way, you should never ever ever use `var` in Spark. In many cases it will break your application before you even start worrying about memory. There are a handful of cases where it won't impact functionality but it's just easier to never use it than to make that distinction – sinanspd Dec 22 '21 at 06:53
Thank you for your answer, appreciate it :) – Zeeshan Dec 22 '21 at 07:02

What is the impact on memory When we override dataframes and Rdds in apache spark?

0 Answers0