How to overwrite the rdd saveAsPickleFile(path) if file already exist in pyspark?

Question

How to overwrite RDD output objects any existing path when we are saving time.

test1:

975078|56691|2.000|20171001_926_570_1322
975078|42993|1.690|20171001_926_570_1322
975078|46462|2.000|20171001_926_570_1322
975078|87815|1.000|20171001_926_570_1322

rdd=sc.textFile('/home/administrator/work/test1').map( lambda x: x.split("|")[:4]).map( lambda r: Row( user_code = r[0],item_code = r[1],qty = float(r[2])))
rdd.coalesce(1).saveAsPickleFile("/home/administrator/work/foobar_seq1")

The first time it is saving properly. now again I removed one line from the input file and saving RDD same location, it show file has existed.

rdd.coalesce(1).saveAsPickleFile("/home/administrator/work/foobar_seq1")

For example, in dataframe we can overwrite existing path.

df.coalesce(1).write().overwrite().save(path)

If I am doing same on RDD object getting an error.

rdd.coalesce(1).write().overwrite().saveAsPickleFile(path)

please help me on this

thanks for quick reply, I have changed my question.ya your right RDD doesn't have write method ,is there any method in RDD which is equals to write method. — Sai, Mar 28 '18 at 08:12
Possible duplicate of [How to overwrite the output directory in spark](https://stackoverflow.com/questions/27033823/how-to-overwrite-the-output-directory-in-spark) — philantrovert, Mar 28 '18 at 09:19

score 1 · Answer 1 · edited May 09 '20 at 12:08

1

Hi you can save RDD files like below Note (code is in scala but logic should be same for python as well) i am using 2.3.0 spark version.

  val sconf = new SparkConf().set("spark.hadoop.validateOutputSpecs", "False").setMaster("local[*]").setAppName("test")
  val scontext = new SparkContext(sconf)
  val lines = scontext.textFile("s${filePath}", 1)
    println(lines.first)
    lines.saveAsTextFile("C:\\Users\\...\\Desktop\\sample2")

or if ur working with DataFrame then use

DF.write.mode(SaveMode.Overwrite).parquet(path.parquet)

or for more info please look at this

edited May 09 '20 at 12:08

Rakshith

644
1
8
24

answered Mar 28 '18 at 09:04

Rajnish Kumar

2,828
5
25
39

Hi Raj Thanks for reply, "DF.write.mode(SaveMode.Overwrite).parquet(path.parquet")" for data frame but i want save RDD object. – Sai Mar 28 '18 at 11:50
ok i have provided older version that is for rdd plz try that – Rajnish Kumar Mar 28 '18 at 11:52
i will try,can you please provided new version huh? – Sai Mar 28 '18 at 11:52
pyspark 2.2 version – Sai Mar 28 '18 at 11:54
ok for 2.2 version are u working with sparkContext or SparkSession if sparkContext then this will work sparkconf = SparkConf().setAppName("someName").set("spark.hadoop.validateOutputSpecs", False) sparkcontext = SparkContext(conf = sparkconf), – Rajnish Kumar Mar 28 '18 at 11:55
spark SparkSession I am using – Sai Mar 28 '18 at 12:13

score 0 · Answer 2 · answered Jul 26 '18 at 09:54

0

While, the rdd without write mode, and you can convert rdd to df , using df overwrite mode. As follows:

df.coalesce(1).toDF().map(lambda x: (x,)).write.csv(path=yourpath, mode='overwrite')

answered Jul 26 '18 at 09:54

damon

1
1

How to overwrite the rdd saveAsPickleFile(path) if file already exist in pyspark?

2 Answers2

Linked