Error in caching a simple RDD with pyspark while without caching the code works fine (How to make a class picklable in a notebook)

Question

I have the following simple code causing an error regarding caching:

trips_in = sc.textFile("trip_data.csv")
trips = trips_in.map(lambda l: l.split(",")).map(lambda x: parseTrip(x)).cache()

trips.count()

The function parseTrip() gets a list of strings and creates and returns a class Trip:

class Trip:
  def __init__(self, id, duration):
    self.id = id
    self.duration = duration

I get the error right after the action count(). However, if I remove the cache() at the end of second line everything work fine. According to the error the problem is that the class Trip can not be pickled:

PicklingError: Can't pickle __main__.Trip: attribute lookup __main__.Trip failed

So how can I make it picklable (if it is an actual word)? Note that I am using a Databricks notebook so I can not make a separate .py for class definition to make it picklable.

zero323 · Accepted Answer · 2017-02-27T20:28:30.590

Environment does not affect the answer - if you want to use custom classes it has to be importable on every node in the cluster.

For a single module you can easily use SparkContext.addPyFile with URL to a GitHub Gist (or another supported format: "file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI")
- Create a gist.
- Click on Raw link and copy URL.
- In your notebook call:
```
sc.addPyFile(raw_gist_url)
```

For complex dependencies you distribute egg files.

Create Python package using setuptools.

Directory structure:

.
├── setup.py
└── trip
    └── __init__.py

Example setup file:

#!/usr/bin/env python

from setuptools import setup

setup(name='trip',
      version='0.0.1',
      description='Trip',
      author='Jane Doe',
      author_email='jane@example.com',
      url='https://example.com',
      packages=['trip'],)

Create egg file:
```
python setup.py bdist_egg
```
This will create dist directory with trip-0.0.1-pyX.Y.egg file
Go to Databricks dashboard -> New -> Libary and upload egg file from dist directory:
Attach library to the cluster you want to use.

Finally if all you want is a record type you can use namedtuple without any additional steps:
```
from collections import namedtuple

Trip = namedtuple('Trip', ['id', 'duration'])
```

Error in caching a simple RDD with pyspark while without caching the code works fine (How to make a class picklable in a notebook)

1 Answers1