-2

I'm wondering if there's a way to combine the final result into a single file when using Spark? Here's the code I have:

conf = SparkConf().setAppName("logs").setMaster("local[*]")
sc = SparkContext(conf = conf)

logs_1 = sc.textFile('logs/logs_1.tsv')
logs_2 = sc.textFile('logs/logs_2.tsv')

url_1 = logs_1.map(lambda line: line.split("\t")[2])
url_2 = logs_2.map(lambda line: line.split("\t")[2])

all_urls = uls_1.intersection(urls_2)
all_urls = all_urls.filter(lambda url: url != "localhost") 

all_urls.collect()

all_urls.saveAsTextFile('logs.csv')

The collect() method doesn't seem to be working (or I've misunderstood its purpose). Essentially, I need the 'saveAsTextFile' to output to a single file, instead of a folder with parts.

Reza Karami
  • 505
  • 1
  • 5
  • 15

3 Answers3

2

Well, before you save, you can repartition once, like below:

all_urls.repartition(1).saveAsTextFile(resultPath)

then you would get just one result file.

Neenad
  • 861
  • 5
  • 19
Shawn.X
  • 1,323
  • 6
  • 15
2

Please find below some suggestions:

  • collect() and saveAsTextFile() are actions that means they will collect the results on the driver node. Therefore is redundant to call both of them.

  • In your case you just need to store the data with saveAsTextFile() there is no need to call collect().

  • collect() returns an array of items (in your case you are not using the returned variable)

  • As Glennie and Akash suggested just use coalesce(1) to force one single partition. coalesce(1) will not cause shuffling hence is much more efficient.

  • In the given code you are using the RDD API of Spark I would suggest to use dataframes/datasets instead.

Please refer on the next links for further details over RDDs and dataframes:

Difference between DataFrame, Dataset, and RDD in Spark

https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

abiratsis
  • 7,051
  • 3
  • 28
  • 46
0

You can store it in a parquet format. This is the best format suited for HDFS

all_urls.write.parquet("dir_name")
Shrey
  • 1,242
  • 1
  • 13
  • 27