9

Consider the following pyspark code

def transformed_data(spark):
    df = spark.read.json('data.json')
    df = expensive_transformation(df)  # (A)    
    return df


df1 = transformed_data(spark)
df = transformed_data(spark)

df1 = foo_transform(df1)
df = bar_transform(df)

return df.join(df1)

my question is: are the operations defined as (A) on transformed_data optimized in the final_view, so that it is only performed once?

Note that this code is not equivalent to

df1 = transformed_data(spark)
df = df1

df1 = foo_transform(df1)
df = bar_transform(df)

df.join(df1)

(at least from the Python's point of view, on which id(df1) = id(df) in this case.

The broader question is: what does spark consider when optimizing two equal DAGs: whether the DAGs (as defined by their edges and nodes) are equal, or whether their object ids (df = df1) are equal?

Jorge Leitao
  • 19,085
  • 19
  • 85
  • 121

1 Answers1

5

Kinda. It relies on Spark having enough information to infer a dependence.

For instance, I replicated your example as described:

from pyspark.sql.functions import hash
def f(spark, filename):
    df=spark.read.csv(filename)
    df2=df.select(hash('_c1').alias('hashc2'))
    df3=df2.select(hash('hashc2').alias('hashc3'))
    df4=df3.select(hash('hashc3').alias('hashc4'))
    return df4

filename = 'some-valid-file.csv'
df_a = f(spark, filename)
df_b = f(spark, filename)
assert df_a != df_b

df_joined = df_a.join(df_b, df_a.hashc4==df_b.hashc4, how='left')

If I explain this resulting dataframe using df_joined.explain(extended=True), I see the following four plans:

== Parsed Logical Plan ==
Join LeftOuter, (hashc4#20 = hashc4#42)
:- Project [hash(hashc3#18, 42) AS hashc4#20]
:  +- Project [hash(hashc2#16, 42) AS hashc3#18]
:     +- Project [hash(_c1#11, 42) AS hashc2#16]
:        +- Relation[_c0#10,_c1#11,_c2#12] csv
+- Project [hash(hashc3#40, 42) AS hashc4#42]
   +- Project [hash(hashc2#38, 42) AS hashc3#40]
      +- Project [hash(_c1#33, 42) AS hashc2#38]
         +- Relation[_c0#32,_c1#33,_c2#34] csv
== Analyzed Logical Plan ==
hashc4: int, hashc4: int
Join LeftOuter, (hashc4#20 = hashc4#42)
:- Project [hash(hashc3#18, 42) AS hashc4#20]
:  +- Project [hash(hashc2#16, 42) AS hashc3#18]
:     +- Project [hash(_c1#11, 42) AS hashc2#16]
:        +- Relation[_c0#10,_c1#11,_c2#12] csv
+- Project [hash(hashc3#40, 42) AS hashc4#42]
   +- Project [hash(hashc2#38, 42) AS hashc3#40]
      +- Project [hash(_c1#33, 42) AS hashc2#38]
         +- Relation[_c0#32,_c1#33,_c2#34] csv
== Optimized Logical Plan ==
Join LeftOuter, (hashc4#20 = hashc4#42)
:- Project [hash(hash(hash(_c1#11, 42), 42), 42) AS hashc4#20]
:  +- Relation[_c0#10,_c1#11,_c2#12] csv
+- Project [hash(hash(hash(_c1#33, 42), 42), 42) AS hashc4#42]
   +- Relation[_c0#32,_c1#33,_c2#34] csv
== Physical Plan ==
SortMergeJoin [hashc4#20], [hashc4#42], LeftOuter
:- *(2) Sort [hashc4#20 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(hashc4#20, 200)
:     +- *(1) Project [hash(hash(hash(_c1#11, 42), 42), 42) AS hashc4#20]
:        +- *(1) FileScan csv [_c1#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file: some-valid-file.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_c1:string>
+- *(4) Sort [hashc4#42 ASC NULLS FIRST], false, 0
   +- ReusedExchange [hashc4#42], Exchange hashpartitioning(hashc4#20, 200)

The physical plan above only reads the CSV once and re-uses all the computation, since Spark detects that the two FileScans are identical (i.e. Spark knows that they are not independent).

Now consider if I replace the read.csv with hand-crafted independent, yet identical RDDs.

from pyspark.sql.functions import hash
def g(spark):
    df=spark.createDataFrame([('a', 'a'), ('b', 'b'), ('c', 'c')], ["_c1", "_c2"])
    df2=df.select(hash('_c1').alias('hashc2'))
    df3=df2.select(hash('hashc2').alias('hashc3'))
    df4=df3.select(hash('hashc3').alias('hashc4'))
    return df4

df_c = g(spark)
df_d = g(spark)
df_joined = df_c.join(df_d, df_c.hashc4==df_d.hashc4, how='left')

In this case, Spark's physical plan scans two different RDDs. Here's the output of running df_joined.explain(extended=True) to confirm.

== Parsed Logical Plan ==
Join LeftOuter, (hashc4#8 = hashc4#18)
:- Project [hash(hashc3#6, 42) AS hashc4#8]
:  +- Project [hash(hashc2#4, 42) AS hashc3#6]
:     +- Project [hash(_c1#0, 42) AS hashc2#4]
:        +- LogicalRDD [_c1#0, _c2#1], false
+- Project [hash(hashc3#16, 42) AS hashc4#18]
   +- Project [hash(hashc2#14, 42) AS hashc3#16]
      +- Project [hash(_c1#10, 42) AS hashc2#14]
         +- LogicalRDD [_c1#10, _c2#11], false

== Analyzed Logical Plan ==
hashc4: int, hashc4: int
Join LeftOuter, (hashc4#8 = hashc4#18)
:- Project [hash(hashc3#6, 42) AS hashc4#8]
:  +- Project [hash(hashc2#4, 42) AS hashc3#6]
:     +- Project [hash(_c1#0, 42) AS hashc2#4]
:        +- LogicalRDD [_c1#0, _c2#1], false
+- Project [hash(hashc3#16, 42) AS hashc4#18]
   +- Project [hash(hashc2#14, 42) AS hashc3#16]
      +- Project [hash(_c1#10, 42) AS hashc2#14]
         +- LogicalRDD [_c1#10, _c2#11], false

== Optimized Logical Plan ==
Join LeftOuter, (hashc4#8 = hashc4#18)
:- Project [hash(hash(hash(_c1#0, 42), 42), 42) AS hashc4#8]
:  +- LogicalRDD [_c1#0, _c2#1], false
+- Project [hash(hash(hash(_c1#10, 42), 42), 42) AS hashc4#18]
   +- LogicalRDD [_c1#10, _c2#11], false

== Physical Plan ==
SortMergeJoin [hashc4#8], [hashc4#18], LeftOuter
:- *(2) Sort [hashc4#8 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(hashc4#8, 200)
:     +- *(1) Project [hash(hash(hash(_c1#0, 42), 42), 42) AS hashc4#8]
:        +- Scan ExistingRDD[_c1#0,_c2#1]
+- *(4) Sort [hashc4#18 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(hashc4#18, 200)
      +- *(3) Project [hash(hash(hash(_c1#10, 42), 42), 42) AS hashc4#18]
         +- Scan ExistingRDD[_c1#10,_c2#11]

This isn't really PySpark-specific behaviour.

Jedi
  • 3,088
  • 2
  • 28
  • 47
  • So, if I understand your answer: when the source is read from X, it does optimize. When the source is created on the fly, it does not. In other words, it uses "whether the DAGs (as defined by their edges and nodes) are equal" – Jorge Leitao Apr 09 '19 at 07:54
  • Right. The Spark DAG has edges (operations) and nodes (RDDs). If two DAGs are not derived from a common node (RDD), then they cannot be optimised even if they are "identical". In the first case, the source RDD was derived using the same `FileScan` and the duplicate ops were optimised away. In the second case, there were two RDDs that Spark did not know could be treated as one. – Jedi Apr 09 '19 at 12:48