I am trying to import the bitarray library into a SparkContext. https://pypi.python.org/pypi/bitarray/0.8.1.
To do this I have zipped up the contexts in the bit array folder and then tried to add it to my python files. However even after I push the library to the nodes my RDD cannot find the library. Here is my code
zip bitarray.zip bitarray-0.8.1/bitarray/*
// Check the contents of the zip file
unzip -l bitarray.zip
Archive: bitarray.zip
Length Date Time Name
--------- ---------- ----- ----
143455 2015-11-06 02:07 bitarray/_bitarray.so
4440 2015-11-06 02:06 bitarray/__init__.py
6224 2015-11-06 02:07 bitarray/__init__.pyc
68516 2015-11-06 02:06 bitarray/test_bitarray.py
78976 2015-11-06 02:07 bitarray/test_bitarray.pyc
--------- -------
301611 5 files
then in spark
import os
# Environment
import findspark
findspark.init("/home/utils/spark-1.6.0/")
import pyspark
sparkConf = pyspark.SparkConf()
sparkConf.set("spark.executor.instances", "2")
sparkConf.set("spark.executor.memory", "10g")
sparkConf.set("spark.executor.cores", "2")
sc = pyspark.SparkContext(conf = sparkConf)
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import HiveContext
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.functions import udf
hiveContext = HiveContext(sc)
PYBLOOM_LIB = '/home/ryandevera/pybloom.zip'
sys.path.append(PYBLOOM_LIB)
sc.addPyFile(PYBLOOM_LIB)
from pybloom import BloomFilter
f = BloomFilter(capacity=1000, error_rate=0.001)
x = sc.parallelize([(1,("hello",4)),(2,("goodbye",5)),(3,("hey",6)),(4,("test",7))],2)
def bloom_filter_spark(iterator):
for id,_ in iterator:
f.add(id)
yield (None, f)
x.mapPartitions(bloom_filter_spark).take(1)
This yields the error -
ImportError: pybloom requires bitarray >= 0.3.4
I am not sure where I am going wrong. Any help would be greatly appreciated!