Is dataframe created using toPandas() method is distributed across the spark cluster?

Question

I am reading a CSV through

data=sc.textFile("filename") 

Df = Sparksql.create dataframe()

Pdf = Df.toPandas ()

Now is Pdf distributed across the spark cluster or it resides in the environment of host machine??

@hadooped does Df() makes the data frame distributed?? Or how do I make dataframe distributed?? — mahanthesh, Aug 05 '15 at 18:31
it would be my understanding after reading the documentation that any Spark dataframes would be distributed across the cluster, but the moment you convert it to the pandas dataframe it would exist on whatever machine/node your code was executed on. — Jared, Aug 06 '15 at 14:32
Possible duplicate of [What is the Spark DataFrame method \`toPandas\` actually doing?](http://stackoverflow.com/questions/29226210/what-is-the-spark-dataframe-method-topandas-actually-doing) — Paul, Apr 20 '16 at 02:04
Possible duplicate of [Requirements for converting Spark dataframe to Pandas/R dataframe](http://stackoverflow.com/questions/30983197/requirements-for-converting-spark-dataframe-to-pandas-r-dataframe) — , Oct 31 '16 at 15:34

score 1 · Answer 1 · answered Mar 21 '19 at 16:02

1

No.

As it says in the PySpark source code of DataFrame:

    .. note:: This method should only be used if the resulting Pandas's DataFrame is expected
        to be small, as all the data is loaded into the driver's memory.

answered Mar 21 '19 at 16:02

Luis A.G.

1,017
2
15
23

Is dataframe created using toPandas() method is distributed across the spark cluster?

1 Answers1