0

I am really new to spark so my question might be too naive. I have a list of objects for which I need to separately perform a number of hive queries. Let's say, I have the following (for simplicity, I ommitted details on configs of my SparkSession.builder) :

class Car(object):
    def __init__(self, color, brand):
        self._color = color
        self._brand = brand

from pyspark.sql import SparkSession
spark = SparkSession.getOrCreate()

cars = [Car('col_'+str(i) , 'brand_'+str(i)) for i in range(100)]  #list of objects to iterate on
results = []
for car in cars:
    query1 = "select * from carcolors where car_color = {} order by dt limit 1".format(car._color)).first()
    first_col = spark.sql(query1)
     query2 = "select * from carbrands where car_brand = {} order by dt limit 1".format(car._brand)).first()
    first_brand = spark.sql(query2)
    results.append([first_col , first_brand])

The for loop seems to me as a really bad idea because there is no parallelisation whatsoever (that is, besides each query). I saw this suggestion : How to run independent transformations in parallel using PySpark? but it doesn't seem to corresond to my case because I do not know the length of my list. Any suggestions on how to do this more efficiently ?

Matina G
  • 1,452
  • 2
  • 14
  • 28
  • Would it be possible for you to provide a small example of your data and the desired output? – pault Mar 20 '19 at 17:25
  • The exemple I gave (iteration over objects and queries regarding attribute values of each object ) is quite representative actually : There is a for loop with no dependance between itérations and some hive queries in each iteration. – Matina G Mar 21 '19 at 09:52
  • [How to make good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples). – pault Mar 21 '19 at 12:56

1 Answers1

-1

You could first just use the len function on your list to get that length.

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
Noppu
  • 59
  • 4