parallelizing for loops in spark

Question

I am really new to spark so my question might be too naive. I have a list of objects for which I need to separately perform a number of hive queries. Let's say, I have the following (for simplicity, I ommitted details on configs of my SparkSession.builder) :

class Car(object):
    def __init__(self, color, brand):
        self._color = color
        self._brand = brand

from pyspark.sql import SparkSession
spark = SparkSession.getOrCreate()

cars = [Car('col_'+str(i) , 'brand_'+str(i)) for i in range(100)]  #list of objects to iterate on
results = []
for car in cars:
    query1 = "select * from carcolors where car_color = {} order by dt limit 1".format(car._color)).first()
    first_col = spark.sql(query1)
     query2 = "select * from carbrands where car_brand = {} order by dt limit 1".format(car._brand)).first()
    first_brand = spark.sql(query2)
    results.append([first_col , first_brand])

The for loop seems to me as a really bad idea because there is no parallelisation whatsoever (that is, besides each query). I saw this suggestion : How to run independent transformations in parallel using PySpark? but it doesn't seem to corresond to my case because I do not know the length of my list. Any suggestions on how to do this more efficiently ?

Would it be possible for you to provide a small example of your data and the desired output? — pault, Mar 20 '19 at 17:25
The exemple I gave (iteration over objects and queries regarding attribute values of each object ) is quite representative actually : There is a for loop with no dependance between itérations and some hive queries in each iteration. — Matina G, Mar 21 '19 at 09:52
[How to make good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples). — pault, Mar 21 '19 at 12:56

score -1 · Answer 1 · edited Feb 05 '20 at 09:56

-1

You could first just use the len function on your list to get that length.

edited Feb 05 '20 at 09:56

Bhargav Rao

50,140
28
121
140

answered Nov 14 '19 at 08:28

Noppu

59
4

parallelizing for loops in spark

1 Answers1