I am really new to spark so my question might be too naive. I have a list of objects for which I need to separately perform a number of hive queries. Let's say, I have the following (for simplicity, I ommitted details on configs of my SparkSession.builder) :
class Car(object):
def __init__(self, color, brand):
self._color = color
self._brand = brand
from pyspark.sql import SparkSession
spark = SparkSession.getOrCreate()
cars = [Car('col_'+str(i) , 'brand_'+str(i)) for i in range(100)] #list of objects to iterate on
results = []
for car in cars:
query1 = "select * from carcolors where car_color = {} order by dt limit 1".format(car._color)).first()
first_col = spark.sql(query1)
query2 = "select * from carbrands where car_brand = {} order by dt limit 1".format(car._brand)).first()
first_brand = spark.sql(query2)
results.append([first_col , first_brand])
The for loop seems to me as a really bad idea because there is no parallelisation whatsoever (that is, besides each query). I saw this suggestion : How to run independent transformations in parallel using PySpark? but it doesn't seem to corresond to my case because I do not know the length of my list. Any suggestions on how to do this more efficiently ?