How to force Spark Dataframe to be split across all the worker nodes?

Asked Feb 06 '19 at 08:45

Active Feb 10 '19 at 16:08

Viewed 883 times

I want to create a small dataframe with just 10 rows. And I want to force this dataframe to be distributed to two worker nodes. My cluster has only two worker nodes. How do I do that?

Currently, whenever I create such a small dataframe, it gets persisted in only one worker node.

I know, Spark is build for Big Data and this question does not make much sense. However, conceptually, I just wanted to know if at all it is feasible or possible to enforce the Spark dataframe to be split across all the worker nodes (given a very small dataframe with 10-50 rows only).

Or, it is completely impossible and we have to rely upon the Spark master for this dataframe distribution?

edited Feb 10 '19 at 16:08

halfer

19,824
17
99
186

asked Feb 06 '19 at 08:45

user3243499

2,953
6
33
75

See this https://stackoverflow.com/questions/48553005/spark-dataframe-not-distributed – Shahab Niaz Feb 06 '19 at 09:37
did you find a way ? – Romain Jouin May 01 '19 at 10:20
nope. Apparently, we don't have any control here. It is the Spark master's task scheduler who has this control and he thinks that it does not make sense to split such a small data frame to different machines. That's overkill and extra processing. – user3243499 May 01 '19 at 10:33

How to force Spark Dataframe to be split across all the worker nodes?

0 Answers0