Write data to Redis from PySpark

Question

In Scala, we would write an RDD to Redis like this:

datardd.foreachPartition(iter => {
      val r = new RedisClient("hosturl", 6379)
      iter.foreach(i => {
        val (str, it) = i
        val map = it.toMap
        r.hmset(str, map)
      })
    })

I tried doing this in PySpark like this: datardd.foreachPartition(storeToRedis), where function storeToRedis is defined as:

def storeToRedis(x):
    r = redis.StrictRedis(host = 'hosturl', port = 6379)
    for i in x:
        r.set(i[0], dict(i[1]))

It gives me this:

ImportError: ('No module named redis', function subimport at 0x47879b0, ('redis',))

Of course, I have imported redis.

@zero323 Is that the way to do it? Install `redis` on every worker. — kamalbanga, Aug 29 '15 at 07:33
python modules to be used in the workers must be on all the workers.... so he means the python redis module, not a redis db installation. — Paul, Aug 29 '15 at 10:50
@Paul: I understood what he meant, and that's what I am asking. Do I have to install the python redis module on all the workers manually? There should be an easier and shortcut way, like Scala API's `addJars` method. — kamalbanga, Aug 29 '15 at 11:26
@kamalbanga I'm unaware of a good way. Of course you could try to use spark to make the workers run `pip` or `easy_install` but unless you can limit workers to one per machine, it might not behave very well. — Paul, Aug 29 '15 at 11:37
@kamalbanga Yes, sort of. I think addPyFile is best for short project oriented modules, not big distributions like, say, scipy. Searching for "scipy not on spark cluster" led to [this from databricks](https://forums.databricks.com/questions/1294/no-module-named-numpy-on-spark-cluster-on-ec2.html) where they suggest install scipy locally and copying the directory to all the workers using a script included in spark. — Paul, Aug 29 '15 at 11:44
@kamalbanga Personally I would recommend Ansible but it is probably a matter of taste. — zero323, Aug 29 '15 at 14:16

score 7 · Accepted Answer · edited May 23 '17 at 12:17

7

PySpark's SparkContext has a addPyFile method specifically for this thing. Make the redis module a zip file (like this) and just call this method:

sc = SparkContext(appName = "analyze")
sc.addPyFile("/path/to/redis.zip")

edited May 23 '17 at 12:17

Community

1
1

answered Aug 30 '15 at 07:36

kamalbanga

1,881
5
27
46

Write data to Redis from PySpark

1 Answers1

Linked