Combined Spark output into single file

Question

I'm wondering if there's a way to combine the final result into a single file when using Spark? Here's the code I have:

conf = SparkConf().setAppName("logs").setMaster("local[*]")
sc = SparkContext(conf = conf)

logs_1 = sc.textFile('logs/logs_1.tsv')
logs_2 = sc.textFile('logs/logs_2.tsv')

url_1 = logs_1.map(lambda line: line.split("\t")[2])
url_2 = logs_2.map(lambda line: line.split("\t")[2])

all_urls = uls_1.intersection(urls_2)
all_urls = all_urls.filter(lambda url: url != "localhost") 

all_urls.collect()

all_urls.saveAsTextFile('logs.csv')

The collect() method doesn't seem to be working (or I've misunderstood its purpose). Essentially, I need the 'saveAsTextFile' to output to a single file, instead of a folder with parts.

Thank you! Though I'm confused by the downvotes on my question. Is this not something worth asking, or is the Spark community just intolerant towards newbies? — Reza Karami, May 15 '19 at 18:52

score 2 · Answer 1 · edited May 15 '19 at 07:24

2

Well, before you save, you can repartition once, like below:

all_urls.repartition(1).saveAsTextFile(resultPath)

then you would get just one result file.

edited May 15 '19 at 07:24

Neenad

861
5
19

answered May 15 '19 at 04:07

Shawn.X

1,323
6
15

2

Or perhaps `coalesce(1)` :) – Glennie Helles Sindholt May 15 '19 at 08:46

abiratsis · Accepted Answer · 2019-05-15T12:26:39.480

Please find below some suggestions:

collect() and saveAsTextFile() are actions that means they will collect the results on the driver node. Therefore is redundant to call both of them.
In your case you just need to store the data with saveAsTextFile() there is no need to call collect().
collect() returns an array of items (in your case you are not using the returned variable)
As Glennie and Akash suggested just use coalesce(1) to force one single partition. coalesce(1) will not cause shuffling hence is much more efficient.
In the given code you are using the RDD API of Spark I would suggest to use dataframes/datasets instead.

Please refer on the next links for further details over RDDs and dataframes:

Difference between DataFrame, Dataset, and RDD in Spark

https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

score 0 · Answer 3 · answered May 15 '19 at 05:24

0

You can store it in a parquet format. This is the best format suited for HDFS

all_urls.write.parquet("dir_name")

answered May 15 '19 at 05:24

Shrey

1,242
1
13
27

Combined Spark output into single file

3 Answers3