2

I'm trying to run some code through the p_tqdm library with p_map() to parallelize some code. I run into this dill-related error that I can't figure out.

Traceback (most recent call last):
  File "C:\Users\uid\AppData\Local\Programs\Python\Python38\lib\threading.py", line 932, in _bootstrap_inner
    self.run()
  File "C:\Users\uid\AppData\Local\Programs\Python\Python38\lib\threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\uid\AppData\Local\Programs\Python\Python38\lib\site-packages\multiprocess\pool.py", line 576, in _handle_results
    task = get()
  File "C:\Users\uid\AppData\Local\Programs\Python\Python38\lib\site-packages\multiprocess\connection.py", line 254, in recv
    return _ForkingPickler.loads(buf.getbuffer())
  File "C:\Users\uid\AppData\Local\Programs\Python\Python38\lib\site-packages\dill\_dill.py", line 275, in loads
    return load(file, ignore, **kwds)
  File "C:\Users\uid\AppData\Local\Programs\Python\Python38\lib\site-packages\dill\_dill.py", line 270, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
  File "C:\Users\uid\AppData\Local\Programs\Python\Python38\lib\site-packages\dill\_dill.py", line 473, in load
    obj = StockUnpickler.load(self)
TypeError: __init__() takes 1 positional argument but 2 were given

My code is structured as:

import pickle
from p_tqdm import p_map

def my_func(data_fp):
    data = pickle.load(open(data_fp, 'rb'))

    # Do basic stuff to the data here.

    return True

class MyClass():
    def process_data(self):
        # Do prep stuff here....get a list of filepaths we will need to load.

        ret = p_map(my_func, data_fp_list, num_cpus=0.75)

        return True

(process_data() is called from within an if __name__ == '__main__': multiprocessing.freeze_support() block in another file)

I've had this fail on both a Windows 10 machine and a RHEL7 machine both running Python 3.8.2. My pip list gives:

dill            0.3.2
filelock        3.0.12
future          0.18.2
idna            2.9
joblib          0.14.1
kiwisolver      1.2.0
matplotlib      3.2.1
multiprocess    0.70.10
numpy           1.18.3
p-tqdm          1.3.3
packaging       20.4
pandas          1.0.3
pathos          0.2.6
Pillow          7.1.1
pip             20.1.1
pox             0.2.8
ppft            1.6.6.2
pyparsing       2.4.7
pytesseract     0.3.4
python-dateutil 2.8.1
pytz            2020.1
regex           2020.5.7
requests        2.23.0
sacremoses      0.0.43
scikit-learn    0.23.1
scipy           1.4.1
sentencepiece   0.1.90
setuptools      41.2.0
six             1.14.0
threadpoolctl   2.1.0
tokenizers      0.8.1rc1
torch           1.5.0+cpu
torchvision     0.6.0+cpu
tqdm            4.45.0
transformers    3.0.2
urllib3         1.25.9
Wand            0.5.9

I've seen similar questions being asked about pathos as it relates to dill/pickle but I can't quite get my head around it. Am I in the same situation as this question?

Update:

This also happens with the built-in multiprocessing library map function, but it doesn't happen if you use threads instead of processes. concurrent.futures.ThreadPoolExecutor.map() worked like a charm. I don't know if this points to an issue with the data I'm pickling or not.

Antoine Zambelli
  • 724
  • 7
  • 19
  • For anyone else in the same boat, I managed to bypass this issue by using `concurrent.futures.ThreadPoolExecutor.map`. Threads seem to work for this pickling problem. Still no clue why it's happening or what an actual fix would be. – Antoine Zambelli Aug 25 '20 at 22:17

0 Answers0