0

Dask appears to be rerunning already executed code when the code reaches the dask compute command. In the below code, an empty directory called 'test' is created before the function with dask is called, but when the function is executed it appears in the log that the 'test' directory is trying to be created again. Can someone help me understand how dask works and why this would be the case?

Here is the test code:

    import pandas as pd
    df = pd.DataFrame({'col1':[1,2,3,4],'col2':[5,5,5,5]})

    import os
    os.mkdir('test')

    def B(df):
        df['col3'] = df['col1'] + 100
        return df

    def A(df):
        from dask import delayed, compute
        import numpy as np
        from dask.distributed import Client, LocalCluster
        import multiprocessing.popen_spawn_win32

        if __name__ == '__main__':
            with LocalCluster(n_workers = 2) as cluster, Client(cluster) as client:
                results_dfs = []
                df_split = np.array_split(df, 2)

                for split in df_split:
                    results_dfs.append(delayed(B)(split))

                result = delayed(pd.concat)(results_dfs)
                result = result.compute()

            return result
        
    result = A(df)
    print(result)

Here is the log for the code, the FileExistsErrors reference the 'test' directory already existing, and the last 7 lines of the log repeat indefinitely until the program is interrupted.

C:\User\Desktop\Code>python main.py
Traceback (most recent call last):
  File "<string>", line 1, in <module>
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\User\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 116, in spawn_main
  File "C:\User\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\User\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 125, in _main
    exitcode = _main(fd, parent_sentinel)
  File "C:\User\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 125, in _main
    prepare(preparation_data)
    prepare(preparation_data)
  File "C:\User\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 236, in prepare
  File "C:\User\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\User\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
  File "C:\User\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "C:\User\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 268, in run_path
    main_content = runpy.run_path(main_path,
  File "C:\User\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 268, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "C:\User\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 97, in _run_module_code
    return _run_module_code(code, init_globals, run_name,
  File "C:\User\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "C:\User\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in _run_code
    _run_code(code, mod_globals, init_globals,
  File "C:\User\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\User\Desktop\Code\main.py", line 5, in <module>
    exec(code, run_globals)
  File "C:\User\Desktop\Code\main.py", line 5, in <module>
    os.mkdir('test')
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'test'
    os.mkdir('test')
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'test'
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 2>>
Traceback (most recent call last):
  File "C:\User\AppData\Local\Programs\Python\Python39\lib\site-packages\tornado\ioloop.py", line 905, in _run
    return self.callback()
  File "C:\User\AppData\Local\Programs\Python\Python39\lib\site-packages\distributed\nanny.py", line 414, in memory_monitor
    process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'

I'm running windows 10, python 3.9, and here are the environment packagespython:

Package           Version
----------------- --------
asgiref           3.3.0
astroid           2.4.2
bokeh             2.2.3
click             7.1.2
cloudpickle       1.6.0
colorama          0.4.4
dask              2021.2.0
distributed       2021.2.0
Django            3.1.4
et-xmlfile        1.0.1
fsspec            0.8.5
HeapDict          1.0.1
isort             5.6.4
jdcal             1.4.1
Jinja2            2.11.2
lazy-object-proxy 1.4.3
locket            0.2.0
MarkupSafe        1.1.1
mccabe            0.6.1
msgpack           1.0.2
numpy             1.19.3
openpyxl          3.0.5
packaging         20.8
pandas            1.1.5
partd             1.1.0
Pillow            8.0.1
pip               21.0.1
psutil            5.8.0
pylint            2.6.0
pyparsing         2.4.7
python-dateutil   2.8.1
pytz              2020.4
PyYAML            5.3.1
reportlab         3.5.59
setuptools        49.2.1
six               1.15.0
sortedcontainers  2.3.0
sqlparse          0.4.1
tblib             1.7.0
toml              0.10.2
toolz             0.11.1
tornado           6.1
typing-extensions 3.7.4.3
wrapt             1.12.1
xlrd              2.0.1
zict              2.0.0
mpLoNsTa
  • 23
  • 6
  • 1
    I don't have sufficient knowledge to express this properly, but you will find some relevant information here: https://stackoverflow.com/questions/43545179/why-does-importing-module-in-main-not-allow-multiprocessig-to-use-module Basically, the code will be 'evaluated' multiple times before execution (you can see that with strategically placed print statements). – SultanOrazbayev Mar 20 '21 at 12:22
  • Thanks, this link helped. If I understand correctly, on Windows the ```__main__ ```program is imported and executed in each child process, but the name is changed from ```__main__``` in the child process. So moving the ```if __name__ == '__main__':``` from function A to the first line of the program prevents the code from being executed in the child processes. The code runs as expected after making that change. – mpLoNsTa Mar 21 '21 at 01:39

0 Answers0