Suppose I have something like
df1 = sqlContext.sql("select count(1) as ct1 from big_table_1")
df2 = sqlContext.sql("select count(1) as ct2 from big_table_2")
df1.show()
df2.show()
Within each table (either Hive or temporary), the rows will be counted in parallel across the worker nodes, assuming the underlying dataframe is partitioned.
Is there also a way I can get the two tables to count in parallel? Is this even possible in PySpark?