Pandas from pyspark in spark

Question

If I created dataframe or rdd and convert it as pandas dataframe. does it still work with spark? or will it work in python memory only?

score 1 · Answer 1 · answered Feb 25 '19 at 11:36

1

if you simply convert a spark df or rdd to pandas you'd get all the data in the master (i.e. on a single machine)

Spark, starting with v 2.4.0 ( you could have done it before as well but with more work to do the translation back and forth) includes the ability to create Pandas user-defined functions (Pandas UDFs see https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html) which lets you use pandas in a distributed manner.Note that each pandas instance in that case will get part of the data

answered Feb 25 '19 at 11:36

Arnon Rotem-Gal-Oz

25,469
3
45
68

Can you provide any supportive document for your first statement? `you'd get all the data in the master (i.e. on a single machine)` – meW Feb 25 '19 at 11:46
1

The *support document* is in the code of toPandas @meW. It use `collect` to create the pandas dataframe. You ought checking the link provided under your question – eliasah Feb 25 '19 at 12:26
@eliasah Will do. Thanks. – meW Feb 25 '19 at 12:27
Thanks. One more question. Is there any way to know my pandas dataframe is in distributed data? If I play with distributed padna dataframe such as apply(), It will still distributed panda dataframe? or It will collect in master ? – DK2 Feb 26 '19 at 02:29
as long as you write something as a UDF it will be evaluated at the workers (actually at a python instance running in parallel to the spark worker) and it will work on its own partial copy of the data – Arnon Rotem-Gal-Oz Feb 26 '19 at 08:12

Pandas from pyspark in spark

1 Answers1