If I created dataframe or rdd and convert it as pandas dataframe. does it still work with spark? or will it work in python memory only?
Asked
Active
Viewed 168 times
1 Answers
1
if you simply convert a spark df or rdd to pandas you'd get all the data in the master (i.e. on a single machine)
Spark, starting with v 2.4.0 ( you could have done it before as well but with more work to do the translation back and forth) includes the ability to create Pandas user-defined functions (Pandas UDFs see https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html) which lets you use pandas in a distributed manner.Note that each pandas instance in that case will get part of the data

Arnon Rotem-Gal-Oz
- 25,469
- 3
- 45
- 68
-
Can you provide any supportive document for your first statement? `you'd get all the data in the master (i.e. on a single machine)` – meW Feb 25 '19 at 11:46
-
1The *support document* is in the code of toPandas @meW. It use `collect` to create the pandas dataframe. You ought checking the link provided under your question – eliasah Feb 25 '19 at 12:26
-
@eliasah Will do. Thanks. – meW Feb 25 '19 at 12:27
-
Thanks. One more question. Is there any way to know my pandas dataframe is in distributed data? If I play with distributed padna dataframe such as apply(), It will still distributed panda dataframe? or It will collect in master ? – DK2 Feb 26 '19 at 02:29
-
as long as you write something as a UDF it will be evaluated at the workers (actually at a python instance running in parallel to the spark worker) and it will work on its own partial copy of the data – Arnon Rotem-Gal-Oz Feb 26 '19 at 08:12