I am pretty new at Parallelism and am looking out for a way to parallelise a text tokenisation task. The task consists of millions of records and can be tokenised using different strategies.
I wrote the code as follow, and bumped into this error: pickle.PicklingError: Could not pickle the task to send it to the workers.. I've checked out the following but no luck at finding similar solution. Could anyone suggest a solution around this code or suggest a better way of implementation? [1] Can't pickle Function [2] Python multiprocessing pickling error [3] Can't pickle <type 'instancemethod'> when using multiprocessing Pool.map()
from joblib import Parallel, delayed
from tokenizer import Predicates
class test():
def __init__(self):
# millions of records
self.data = [
("user name", "abc@gmail.com"),
("user1 abc", "abc@gmail.com"),
("abc user1 ", "abcd@gmail.com")
]
def proc(self, data, strategy):
# unwrap different strategy for different column
func_username, func_email = strategy
res = []
for uname, email in data:
# func call to tokenise column data
username_tok = func_username(uname)
email_tok = func_email(uname)
res.append((username_tok, email_tok))
return res
def run(self):
# define different tokenisation strategy
strategy = [(Predicates().tokenFingerprint, Predicates().tokenFingerprint),
(Predicates().otherMethod, Predicates().otherMethod)]
# assign tokenisation jobs to multiple thread.
# NOTE that to simplify the example, self.data is not splitted here. So both threads work on the same dataset now.
lsres_strategy = []
for func_username, func_email in strategy:
lsres = Parallel(n_jobs=2)(delayed(self.proc)(self.data, (func_username, func_email)))
lsres_strategy.append(lsres)
t = test()
t.run()
The tokenisation strategies are defined as
class Predicates:
def __init__(self):
pass
def tokenFingerprint(self, field):
return (u''.join(sorted(field.split())))
def otherMethod(self, field):
return (u''.join(field.split()))