0

Suppose I have something like

df1 = sqlContext.sql("select count(1) as ct1 from big_table_1")
df2 = sqlContext.sql("select count(1) as ct2 from big_table_2")
df1.show()
df2.show()

Within each table (either Hive or temporary), the rows will be counted in parallel across the worker nodes, assuming the underlying dataframe is partitioned.

Is there also a way I can get the two tables to count in parallel? Is this even possible in PySpark?

wrschneider
  • 17,913
  • 16
  • 96
  • 176

0 Answers0