Spark >= 2.3.0
Since Spark 2.3.0 it is possible to use Pandas Series
or DataFrame
by partition or group. See for example:
Spark < 2.3.0
what is the natural way to create either a local PySpark
There is no such thing. Spark distributed data structures cannot be nested or you prefer another perspective you cannot nest actions or transformations.
or Pandas DataFrame
It is relatively easy but you have to remember at least few things:
- Pandas and Spark DataFrames are not even remotely equivalent. These are different structures, with different properties and in general you cannot replace one with another.
- Partitions can be empty.
- It looks like you're passing dictionaries. Remember that base Python dictionary is unordered (unlike
collections.OrderedDict
for example). So passing columns may not work as expected.
import pandas as pd
rdd = sc.parallelize([
{"x": 1, "y": -1},
{"x": -3, "y": 0},
{"x": -0, "y": 4}
])
def combine(iter):
rows = list(iter)
return [pd.DataFrame(rows)] if rows else []
rdd.mapPartitions(combine).first()
## x y
## 0 1 -1