0

If I created dataframe or rdd and convert it as pandas dataframe. does it still work with spark? or will it work in python memory only?

DK2
  • 651
  • 1
  • 7
  • 34

1 Answers1

1

if you simply convert a spark df or rdd to pandas you'd get all the data in the master (i.e. on a single machine)

Spark, starting with v 2.4.0 ( you could have done it before as well but with more work to do the translation back and forth) includes the ability to create Pandas user-defined functions (Pandas UDFs see https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html) which lets you use pandas in a distributed manner.Note that each pandas instance in that case will get part of the data

Arnon Rotem-Gal-Oz
  • 25,469
  • 3
  • 45
  • 68
  • Can you provide any supportive document for your first statement? `you'd get all the data in the master (i.e. on a single machine)` – meW Feb 25 '19 at 11:46
  • 1
    The *support document* is in the code of toPandas @meW. It use `collect` to create the pandas dataframe. You ought checking the link provided under your question – eliasah Feb 25 '19 at 12:26
  • @eliasah Will do. Thanks. – meW Feb 25 '19 at 12:27
  • Thanks. One more question. Is there any way to know my pandas dataframe is in distributed data? If I play with distributed padna dataframe such as apply(), It will still distributed panda dataframe? or It will collect in master ? – DK2 Feb 26 '19 at 02:29
  • as long as you write something as a UDF it will be evaluated at the workers (actually at a python instance running in parallel to the spark worker) and it will work on its own partial copy of the data – Arnon Rotem-Gal-Oz Feb 26 '19 at 08:12