spark save dataframe to multiple csv files

Asked Aug 04 '16 at 12:47

Active Aug 04 '16 at 12:47

Viewed 2,091 times

This question has a partial answer here : Write to multiple outputs by key Spark - one Spark job

But i want to save a Dataframe to multiple csv files.

df =  sqlContext.createDataFrame([Row(name=u'name1', website=u'http://1', url=u'1'),
 Row(name=u'name2', website=u'http://1', url=u'1'),
 Row(name=u'name3', website=u'https://fsadf', url=u'2'),
 Row(name=u'name4', website=None, url=u'3')])

df.write.format('com.databricks.spark.csv').partitionBy("name").save("dataset.csv")

And i'm using spark-csv (https://github.com/databricks/spark-csv) to handle csv data.

One more thing, df.write.partitionBy("column").json("dataset") , saves data to multiple directories like column=value1, column=value2 etc , but the column itself is not present in the data.

what if i need that column in the output dataset ??

edited May 23 '17 at 12:22

Community

asked Aug 04 '16 at 12:47

frank

_what if i need that column in the output dataset ??_ - then you'll have to either create partitions manually (without `partitionBy`) or do a second sweep. – zero323 Aug 05 '16 at 12:04
@zero323, i'm ready to create partitions manually or do a second sweep , but i'm not sure how to do that . could you please help me ? – frank Aug 09 '16 at 07:56
Option a) After save, load data directory by directory and overwrite with added column b) collect a range of unique values and perform separate filter and write for each. If makes to `repartition` data by the column of interest first. – zero323 Aug 09 '16 at 11:12
Thanks for the answer. – frank Aug 09 '16 at 11:42
2

There is also another possible trick - just duplicate partitioning column. `dt.withColumn("key", col("name")).write.partitionBy("key")` – zero323 Aug 09 '16 at 11:46
thanks, last one did the trick. – frank Aug 10 '16 at 03:52

spark save dataframe to multiple csv files

0 Answers0