0

When I get a table with PySpark

df1 = session.read.jdbc(url=self.url,
                        table=self.table,
                        properties={"driver": self.driver,
                                    "user": self.user,
                                    "password": self.password})

or

df1 = sql.read.format("com.databricks.spark.redshift")
                .options(url=self.url, 
                        query=self.table,
                        user=self.user,
                        password=self.password,
                        tempdir="s3://path/data/").load()

and then I apply some transformations like joins and groupBy

df3 = df1.join(df2, df1.id == df2.id)
df_res = df3.groupBy("id").agg({'tax': 'sum'})

is that done against the DB? If yes, what is the way to do that in memory?

zero323
  • 322,348
  • 103
  • 959
  • 935
Alezis
  • 1,182
  • 3
  • 13
  • 25

1 Answers1

2

It is not. While Spark can push down simple projections and selections (details depend on a particular data source implementation) it doesn't apply heavy processing against database, unless it is explicitly instructed to in a supported data source.

For example with jdbc you pass a subquery as the table argument, and with com.databricks.spark.redshift you can use query option.

With the first snippet (jdbc) Spark will read all the required data (after applying possible projections and selections) using a single executor, in the second snippet the result of the initial query (plus projections and selections) to S3 and read it from there in parallel. In both cases Spark should be able to push down projection (id, tax).

After that Spark will process data locally inside the cluster, this includes join and aggregation.

zero323
  • 322,348
  • 103
  • 959
  • 935