2

I am pretty new at Parallelism and am looking out for a way to parallelise a text tokenisation task. The task consists of millions of records and can be tokenised using different strategies.

I wrote the code as follow, and bumped into this error: pickle.PicklingError: Could not pickle the task to send it to the workers.. I've checked out the following but no luck at finding similar solution. Could anyone suggest a solution around this code or suggest a better way of implementation? [1] Can't pickle Function [2] Python multiprocessing pickling error [3] Can't pickle <type 'instancemethod'> when using multiprocessing Pool.map()

from joblib import Parallel, delayed
from tokenizer import Predicates

class test():
    def __init__(self):

        # millions of records
        self.data = [
            ("user name", "abc@gmail.com"),
            ("user1 abc", "abc@gmail.com"),
            ("abc user1 ", "abcd@gmail.com")
        ]

    def proc(self, data, strategy):
        # unwrap different strategy for different column
        func_username, func_email = strategy
        res = []
        for uname, email in data:
            # func call to tokenise column data
            username_tok = func_username(uname)
            email_tok = func_email(uname)
            res.append((username_tok, email_tok))
        return res

    def run(self):
        # define different tokenisation strategy
        strategy = [(Predicates().tokenFingerprint, Predicates().tokenFingerprint),
                (Predicates().otherMethod, Predicates().otherMethod)]

        # assign tokenisation jobs to multiple thread.
        # NOTE that to simplify the example, self.data is not splitted here. So both threads work on the same dataset now.

        lsres_strategy = []
        for func_username, func_email in strategy:
            lsres = Parallel(n_jobs=2)(delayed(self.proc)(self.data, (func_username, func_email)))
            lsres_strategy.append(lsres)

    t = test()
    t.run()

The tokenisation strategies are defined as

class Predicates:
    def __init__(self):
        pass

    def tokenFingerprint(self, field):
        return (u''.join(sorted(field.split())))

    def otherMethod(self, field):
        return (u''.join(field.split()))
twfx
  • 1,664
  • 9
  • 29
  • 56
  • 2
    If you passed a [second `check_pickle` argument set to `True`](https://github.com/joblib/joblib/blob/35901b8fbbeeb3e696f0c1b36f863ed17bb92eb1/joblib/parallel.py#L330), you may have caught this early. I think [instance methods can't be pickled](https://bytes.com/topic/python/answers/552476-why-cant-you-pickle-instancemethods), I have seen a lot of posts about this problem in SO. What you can try here is to refactor self.proc to a function. It doesn't look to be using any class properties or attributes to warrant it being an instance method anyway. – Oluwafemi Sule Jul 29 '18 at 09:07

0 Answers0