Imagine you are loading a large dataset by the SparkContext and Hive. So this dataset is then distributed in your Spark cluster. For instance a observations (values + timestamps) for thousands of variables.
Now you would use some map/reduce methods or aggregations to organize/analyze your data. For instance grouping by variable name.
Once grouped, you could get all observations (values) for each variable as a timeseries Dataframe. If you now use DataFrame.toPandas
def myFunction(data_frame):
data_frame.toPandas()
df = sc.load....
df.groupBy('var_name').mapValues(_.toDF).map(myFunction)
- is this converted to a Pandas Dataframe (per Variable) on each worker node, or
- are Pandas Dataframes always on the driver node and the data is therefore transferred from the worker nodes to the driver?