Importing bitarray library into SparkContext

Question

I am trying to import the bitarray library into a SparkContext. https://pypi.python.org/pypi/bitarray/0.8.1.

To do this I have zipped up the contexts in the bit array folder and then tried to add it to my python files. However even after I push the library to the nodes my RDD cannot find the library. Here is my code

zip bitarray.zip bitarray-0.8.1/bitarray/*

// Check the contents of the zip file 

unzip -l bitarray.zip
Archive:  bitarray.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
   143455  2015-11-06 02:07   bitarray/_bitarray.so
     4440  2015-11-06 02:06   bitarray/__init__.py
     6224  2015-11-06 02:07   bitarray/__init__.pyc
    68516  2015-11-06 02:06   bitarray/test_bitarray.py
    78976  2015-11-06 02:07   bitarray/test_bitarray.pyc
---------                     -------
   301611                     5 files

then in spark

import os 

# Environment
import findspark
findspark.init("/home/utils/spark-1.6.0/")

import pyspark
sparkConf = pyspark.SparkConf()

sparkConf.set("spark.executor.instances", "2") 
sparkConf.set("spark.executor.memory", "10g")
sparkConf.set("spark.executor.cores", "2")

sc = pyspark.SparkContext(conf = sparkConf)

from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import HiveContext
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.functions import udf

hiveContext = HiveContext(sc)

PYBLOOM_LIB = '/home/ryandevera/pybloom.zip'
sys.path.append(PYBLOOM_LIB)
sc.addPyFile(PYBLOOM_LIB)

from pybloom import BloomFilter
f = BloomFilter(capacity=1000, error_rate=0.001)
x = sc.parallelize([(1,("hello",4)),(2,("goodbye",5)),(3,("hey",6)),(4,("test",7))],2)


def bloom_filter_spark(iterator):
    for id,_ in iterator:
        f.add(id)
    yield (None, f)

x.mapPartitions(bloom_filter_spark).take(1)

This yields the error -

ImportError: pybloom requires bitarray >= 0.3.4

I am not sure where I am going wrong. Any help would be greatly appreciated!

zero323 · Accepted Answer · 2016-01-23T10:07:36.423

2

Probably the simplest thing you is to create and distribute egg files. Assuming you've downloaded and unpacked source files from PyPI and set PYBLOOM_SOURCE_DIR and BITARRAY_SOURCE_DIR variables:

cd $PYBLOOM_SOURCE_DIR
python setup.py bdist_egg
cd $BITARRAY_SOURCE_DIR
python setup.py bdist_egg

In PySpark add:

from itertools import chain
import os
import glob

eggs = chain.from_iterable([
  glob.glob(os.path.join(os.environ[x], "dist/*")) for x in   
  ["PYBLOOM_SOURCE_DIR", "BITARRAY_SOURCE_DIR"]
])

for egg in eggs: sc.addPyFile(egg)

Problem is that BloomFilter object cannot be properly serialized so if you want to use it you'll have to either patch it or extract bitarrays and pass these around:

def buildFilter(iter):
    bf = BloomFilter(capacity=1000, error_rate=0.001)
    for x in iter:
        bf.add(x)
    return [bf.bitarray]

rdd = sc.parallelize(range(100))
rdd.mapPartitions(buildFilter).reduce(lambda x, y: x | y)

edited Jan 23 '16 at 10:07

answered Jan 23 '16 at 09:59

zero323

322,348
103
959
935

Thank you! I am going to try this out. – RDizzl3 Jan 23 '16 at 20:25
I have followed the code you provided. I think this would be a good solution. However, I am running into trouble. It seems that the access to `/home/.python-egg` is restricted. I have set `os.['PYTHON_EGG_CACHE']` to a different temporary folder that I have access too however I think the issue is on the nodes. Do you know how to fix this? I will post the error. – RDizzl3 Jan 23 '16 at 23:11
**ExtractionError: Can't extract file(s) to egg cache The following error occurred while trying to extract file(s) to the Python egg cache: [Errno 13] Permission denied: '/home/.python-eggs' The Python egg cache directory is currently set to: /home/.python-eggs Perhaps your account does not have write access to this directory? You can change the cache directory by setting the PYTHON_EGG_CACHE environment variable to point to an accessible directory.** – RDizzl3 Jan 23 '16 at 23:11
1

`/home/.python-egg` is kind of troubling. It should in a home directory of the current user not in a `/home/` root. Anyway, if you want to use `PYTHON_EGG_CACHE` you have to set it on each worker. You can try to execute `os.environ['PYTHON_EGG_CACHE'] = some_dir` inside the `map` / `mapPartitions` function before you import `bitarray` / `pybloom`. Another trick you can try is to actually install required libraries on runtime in user space (see http://stackoverflow.com/a/34385088/1560062). – zero323 Jan 24 '16 at 10:55
Thank you for your help! – RDizzl3 Jan 24 '16 at 17:54

Importing bitarray library into SparkContext

1 Answers1