0

I am writing Spark application (Single client) and dealing with lots of small files upon whom I want to run an algorithm. The same algorithm for each one of them. But the files cannot be loaded into the same RDD for the algorithm to work, Because it should sort data within one file boundary.
Today I work on a file at a time, As a result I have poor resource utilization (Small amount of data each action, lots of overhead)
Is there any way to perform the same action/transformation on multiple RDD's simultaneously (And only using one driver program)? Or should I look for another platform? Because such mode of operation isn't classic for Spark.

Rtik88
  • 1,777
  • 2
  • 11
  • 13
  • Something like this: http://stackoverflow.com/a/31916657/1560062? – zero323 Oct 01 '15 at 13:36
  • So basically what the post suggests is to use multiple SparkContext objects (Or SqlContext in the case of the post)? I am using Spark with Python, and if I am trying to configure multiple context objects i get: 'ValueError: Cannot run multiple SparkContexts at once' – Rtik88 Oct 01 '15 at 14:55
  • No, only one context and asynchronous submission. As long as you have spare resources on the cluster these should be processed in parallel. – zero323 Oct 01 '15 at 14:57
  • First of all, thanks! Would you please elaborate on how to apply asynchronous submission of RDD action, using pyspark? I haven't found any API to do so. Thanks again! – Rtik88 Oct 01 '15 at 15:15
  • Nothing PySpark specific. `asyncio` for example. – zero323 Oct 01 '15 at 15:20

1 Answers1

1

If you use SparkContext.wholeTextFiles, then you could read the files into one RDD and each partition of the RDD would have the content of a single file. Then, you could work on each partition separately using SparkContext.mapPartitions(sort_file), where sort_file is the sorting function that you want to apply on each file. This would use concurrency better than your current solution, as long as your files are small enough that they can be processed in a single partition.

mrm
  • 108
  • 2
  • 8