1

I have the following simple code causing an error regarding caching:

trips_in = sc.textFile("trip_data.csv")
trips = trips_in.map(lambda l: l.split(",")).map(lambda x: parseTrip(x)).cache()

trips.count()

The function parseTrip() gets a list of strings and creates and returns a class Trip:

class Trip:
  def __init__(self, id, duration):
    self.id = id
    self.duration = duration

I get the error right after the action count(). However, if I remove the cache() at the end of second line everything work fine. According to the error the problem is that the class Trip can not be pickled:

PicklingError: Can't pickle __main__.Trip: attribute lookup __main__.Trip failed

So how can I make it picklable (if it is an actual word)? Note that I am using a Databricks notebook so I can not make a separate .py for class definition to make it picklable.

zero323
  • 322,348
  • 103
  • 959
  • 935
Hamed
  • 474
  • 5
  • 17

1 Answers1

1

Environment does not affect the answer - if you want to use custom classes it has to be importable on every node in the cluster.

  • For a single module you can easily use SparkContext.addPyFile with URL to a GitHub Gist (or another supported format: "file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI")

    • Create a gist.
    • Click on Raw link and copy URL.
    • In your notebook call:

      sc.addPyFile(raw_gist_url)
      
  • For complex dependencies you distribute egg files.

    • Create Python package using setuptools.

      Directory structure:

      .
      ├── setup.py
      └── trip
          └── __init__.py
      

      Example setup file:

      #!/usr/bin/env python
      
      from setuptools import setup
      
      setup(name='trip',
            version='0.0.1',
            description='Trip',
            author='Jane Doe',
            author_email='jane@example.com',
            url='https://example.com',
            packages=['trip'],)
      
    • Create egg file:

      python setup.py bdist_egg
      

      This will create dist directory with trip-0.0.1-pyX.Y.egg file

    • Go to Databricks dashboard -> New -> Libary and upload egg file from dist directory:

      enter image description here

    • Attach library to the cluster you want to use.

  • Finally if all you want is a record type you can use namedtuple without any additional steps:

    from collections import namedtuple
    
    Trip = namedtuple('Trip', ['id', 'duration'])
    
zero323
  • 322,348
  • 103
  • 959
  • 935