I noticed that when launching this bunch of code with only one action, I have three jobs that are launched.
from typing import List
from pyspark.sql import DataFrame
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.functions import avg
data: List = [("Diamant_1A", "TopDiamant", "300", "rouge"),
("Diamant_2B", "Diamants pour toujours", "45", "jaune"),
("Diamant_3C", "Mes diamants préférés", "78", "rouge"),
("Diamant_4D", "Diamants que j'aime", "90", "jaune"),
("Diamant_5E", "TopDiamant", "89", "bleu")
]
schema: StructType = StructType([ \
StructField("reference", StringType(), True), \
StructField("marque", StringType(), True), \
StructField("prix", StringType(), True), \
StructField("couleur", StringType(), True)
])
dataframe: DataFrame = spark.createDataFrame(data=data,schema=schema)
dataframe_filtree:DataFrame = dataframe.filter("prix > 50")
dataframe_filtree.show()
From my understanding, I should get only one. One action corresponds to one job. I'm using Databricks. It could be the problem. I have 2 questions :
- Why do I have 3 jobs instead of 1?
- Can I change this behaviour?