5

In Scala, we would write an RDD to Redis like this:

datardd.foreachPartition(iter => {
      val r = new RedisClient("hosturl", 6379)
      iter.foreach(i => {
        val (str, it) = i
        val map = it.toMap
        r.hmset(str, map)
      })
    })

I tried doing this in PySpark like this: datardd.foreachPartition(storeToRedis), where function storeToRedis is defined as:

def storeToRedis(x):
    r = redis.StrictRedis(host = 'hosturl', port = 6379)
    for i in x:
        r.set(i[0], dict(i[1]))

It gives me this:

ImportError: ('No module named redis', function subimport at 0x47879b0, ('redis',))

Of course, I have imported redis.

kamalbanga
  • 1,881
  • 5
  • 27
  • 46
  • 2
    Is `redis` installed on every worker? – zero323 Aug 28 '15 at 16:43
  • @zero323 Is that the way to do it? Install `redis` on every worker. – kamalbanga Aug 29 '15 at 07:33
  • 1
    python modules to be used in the workers must be on all the workers.... so he means the python redis module, not a redis db installation. – Paul Aug 29 '15 at 10:50
  • @Paul: I understood what he meant, and that's what I am asking. Do I have to install the python redis module on all the workers manually? There should be an easier and shortcut way, like Scala API's `addJars` method. – kamalbanga Aug 29 '15 at 11:26
  • @kamalbanga I'm unaware of a good way. Of course you could try to use spark to make the workers run `pip` or `easy_install` but unless you can limit workers to one per machine, it might not behave very well. – Paul Aug 29 '15 at 11:37
  • @Paul Doesn't PySpark API's `addPyFile` do this thing? – kamalbanga Aug 29 '15 at 11:39
  • @kamalbanga Yes, sort of. I think addPyFile is best for short project oriented modules, not big distributions like, say, scipy. Searching for "scipy not on spark cluster" led to [this from databricks](https://forums.databricks.com/questions/1294/no-module-named-numpy-on-spark-cluster-on-ec2.html) where they suggest install scipy locally and copying the directory to all the workers using a script included in spark. – Paul Aug 29 '15 at 11:44
  • Oh okay. Thanks for this link. – kamalbanga Aug 29 '15 at 12:05
  • @kamalbanga Personally I would recommend Ansible but it is probably a matter of taste. – zero323 Aug 29 '15 at 14:16

1 Answers1

7

PySpark's SparkContext has a addPyFile method specifically for this thing. Make the redis module a zip file (like this) and just call this method:

sc = SparkContext(appName = "analyze")
sc.addPyFile("/path/to/redis.zip")
Community
  • 1
  • 1
kamalbanga
  • 1,881
  • 5
  • 27
  • 46