2

I want to create a small dataframe with just 10 rows. And I want to force this dataframe to be distributed to two worker nodes. My cluster has only two worker nodes. How do I do that?

Currently, whenever I create such a small dataframe, it gets persisted in only one worker node.

I know, Spark is build for Big Data and this question does not make much sense. However, conceptually, I just wanted to know if at all it is feasible or possible to enforce the Spark dataframe to be split across all the worker nodes (given a very small dataframe with 10-50 rows only).

Or, it is completely impossible and we have to rely upon the Spark master for this dataframe distribution?

halfer
  • 19,824
  • 17
  • 99
  • 186
user3243499
  • 2,953
  • 6
  • 33
  • 75
  • See this https://stackoverflow.com/questions/48553005/spark-dataframe-not-distributed – Shahab Niaz Feb 06 '19 at 09:37
  • did you find a way ? – Romain Jouin May 01 '19 at 10:20
  • nope. Apparently, we don't have any control here. It is the Spark master's task scheduler who has this control and he thinks that it does not make sense to split such a small data frame to different machines. That's overkill and extra processing. – user3243499 May 01 '19 at 10:33

0 Answers0