Dask appears to be rerunning already executed code when the code reaches the dask compute command. In the below code, an empty directory called 'test' is created before the function with dask is called, but when the function is executed it appears in the log that the 'test' directory is trying to be created again. Can someone help me understand how dask works and why this would be the case?
Here is the test code:
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[5,5,5,5]})
import os
os.mkdir('test')
def B(df):
df['col3'] = df['col1'] + 100
return df
def A(df):
from dask import delayed, compute
import numpy as np
from dask.distributed import Client, LocalCluster
import multiprocessing.popen_spawn_win32
if __name__ == '__main__':
with LocalCluster(n_workers = 2) as cluster, Client(cluster) as client:
results_dfs = []
df_split = np.array_split(df, 2)
for split in df_split:
results_dfs.append(delayed(B)(split))
result = delayed(pd.concat)(results_dfs)
result = result.compute()
return result
result = A(df)
print(result)
Here is the log for the code, the FileExistsErrors reference the 'test' directory already existing, and the last 7 lines of the log repeat indefinitely until the program is interrupted.
C:\User\Desktop\Code>python main.py
Traceback (most recent call last):
File "<string>", line 1, in <module>
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\User\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 116, in spawn_main
File "C:\User\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\User\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 125, in _main
exitcode = _main(fd, parent_sentinel)
File "C:\User\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 125, in _main
prepare(preparation_data)
prepare(preparation_data)
File "C:\User\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 236, in prepare
File "C:\User\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
_fixup_main_from_path(data['init_main_from_path'])
File "C:\User\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
File "C:\User\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "C:\User\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 268, in run_path
main_content = runpy.run_path(main_path,
File "C:\User\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 268, in run_path
return _run_module_code(code, init_globals, run_name,
File "C:\User\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 97, in _run_module_code
return _run_module_code(code, init_globals, run_name,
File "C:\User\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "C:\User\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in _run_code
_run_code(code, mod_globals, init_globals,
File "C:\User\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\User\Desktop\Code\main.py", line 5, in <module>
exec(code, run_globals)
File "C:\User\Desktop\Code\main.py", line 5, in <module>
os.mkdir('test')
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'test'
os.mkdir('test')
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'test'
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 2>>
Traceback (most recent call last):
File "C:\User\AppData\Local\Programs\Python\Python39\lib\site-packages\tornado\ioloop.py", line 905, in _run
return self.callback()
File "C:\User\AppData\Local\Programs\Python\Python39\lib\site-packages\distributed\nanny.py", line 414, in memory_monitor
process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'
I'm running windows 10, python 3.9, and here are the environment packagespython:
Package Version
----------------- --------
asgiref 3.3.0
astroid 2.4.2
bokeh 2.2.3
click 7.1.2
cloudpickle 1.6.0
colorama 0.4.4
dask 2021.2.0
distributed 2021.2.0
Django 3.1.4
et-xmlfile 1.0.1
fsspec 0.8.5
HeapDict 1.0.1
isort 5.6.4
jdcal 1.4.1
Jinja2 2.11.2
lazy-object-proxy 1.4.3
locket 0.2.0
MarkupSafe 1.1.1
mccabe 0.6.1
msgpack 1.0.2
numpy 1.19.3
openpyxl 3.0.5
packaging 20.8
pandas 1.1.5
partd 1.1.0
Pillow 8.0.1
pip 21.0.1
psutil 5.8.0
pylint 2.6.0
pyparsing 2.4.7
python-dateutil 2.8.1
pytz 2020.4
PyYAML 5.3.1
reportlab 3.5.59
setuptools 49.2.1
six 1.15.0
sortedcontainers 2.3.0
sqlparse 0.4.1
tblib 1.7.0
toml 0.10.2
toolz 0.11.1
tornado 6.1
typing-extensions 3.7.4.3
wrapt 1.12.1
xlrd 2.0.1
zict 2.0.0