2

I am reading a CSV through

data=sc.textFile("filename") 

Df = Sparksql.create dataframe()

Pdf = Df.toPandas ()

Now is Pdf distributed across the spark cluster or it resides in the environment of host machine??

mahanthesh
  • 71
  • 1
  • 10
  • 1
    It would be residing locally in the driver machine – None Aug 05 '15 at 18:26
  • @hadooped does Df() makes the data frame distributed?? Or how do I make dataframe distributed?? – mahanthesh Aug 05 '15 at 18:31
  • it would be my understanding after reading the documentation that any Spark dataframes would be distributed across the cluster, but the moment you convert it to the pandas dataframe it would exist on whatever machine/node your code was executed on. – Jared Aug 06 '15 at 14:32
  • 1
    Possible duplicate of [What is the Spark DataFrame method \`toPandas\` actually doing?](http://stackoverflow.com/questions/29226210/what-is-the-spark-dataframe-method-topandas-actually-doing) – Paul Apr 20 '16 at 02:04
  • Possible duplicate of [Requirements for converting Spark dataframe to Pandas/R dataframe](http://stackoverflow.com/questions/30983197/requirements-for-converting-spark-dataframe-to-pandas-r-dataframe) –  Oct 31 '16 at 15:34

1 Answers1

1

No.

As it says in the PySpark source code of DataFrame:

    .. note:: This method should only be used if the resulting Pandas's DataFrame is expected
        to be small, as all the data is loaded into the driver's memory.
Luis A.G.
  • 1,017
  • 2
  • 15
  • 23