is Dataframe.toPandas always on driver node or on worker nodes?

Question

Imagine you are loading a large dataset by the SparkContext and Hive. So this dataset is then distributed in your Spark cluster. For instance a observations (values + timestamps) for thousands of variables.

Now you would use some map/reduce methods or aggregations to organize/analyze your data. For instance grouping by variable name.

Once grouped, you could get all observations (values) for each variable as a timeseries Dataframe. If you now use DataFrame.toPandas

def myFunction(data_frame):
   data_frame.toPandas()

df = sc.load....
df.groupBy('var_name').mapValues(_.toDF).map(myFunction)

is this converted to a Pandas Dataframe (per Variable) on each worker node, or
are Pandas Dataframes always on the driver node and the data is therefore transferred from the worker nodes to the driver?

score 5 · Accepted Answer · edited May 23 '17 at 12:16

5

There is nothing special about Pandas DataFrame in this context.

If DataFrame is created by using toPandas method on pyspark.sql.dataframe.DataFrame this collects data and creates local Python object on the driver.
If pandas.core.frame.DataFrame is created inside executor process (for example in mapPartitions) you simply get RDD[pandas.core.frame.DataFrame]. There is no distinction between Pandas objects and let's say a tuple.
Finally pseudocode in you example couldn't work becasue you cannot create (in a sensible way) Spark DataFrame (I assume this what you mean by _.toDF) inside executor thread.

edited May 23 '17 at 12:16

Community

1
1

answered Aug 25 '16 at 21:35

zero323

322,348
103
959
935

1

So you could use the Pandas Dataframe API for instance within a map function in order to use its more convo enemy methods even in the worker node. for instance doing some analytics for just that information in the map step and return the result. – Matthias Aug 25 '16 at 21:43
Yes, it is possible, in a similar as SparkR `dapply`. Getting desired performance can be tricky though so you'll have to balance resource allocation and parallelism. – zero323 Aug 25 '16 at 21:50
You're my hero. I'm still a newbie but improving. Maybe you can help me out with [that one, too](http://stackoverflow.com/questions/39155954/how-to-do-a-nested-for-each-loop-with-pyspark). – Matthias Aug 25 '16 at 22:55
Depends on how you create it, see my answer here https://stackoverflow.com/a/47716346/843463 – chhantyal Dec 08 '17 at 14:30

is Dataframe.toPandas always on driver node or on worker nodes?

1 Answers1

Linked