I am downloading a lot of files from a website and want them to run parallel because they are heavy. Unfourtanetly I can't really share the website because to access the files I need a username and password which I can't share. The code below is my code, I know it can't really be run without the website and my username and password but I am 99% sure I am not allowed to share that information
import os
import requests
from multiprocessing import Process
dataset="dataset_name"
################################
def down_file(dspath, file, savepath, ret):
webfilename = dspath+file
file_base = os.path.basename(file)
file = join(savepath, file_base)
print('...Downloading',file_base)
req = requests.get(webfilename, cookies = ret.cookies, allow_redirects=True, stream=True)
filesize = int(req.headers['Content-length'])
with open(file, 'wb') as outfile:
chunk_size=1048576
for chunk in req.iter_content(chunk_size=chunk_size):
outfile.write(chunk)
return None
################################
##Download files
def download_files(filelist, c_DateNow):
## Authenticate
url = 'url'
values = {'email' : 'email', 'passwd' : "password", 'action' : 'login'}
ret = requests.post(url, data=values)
## Path to files
dspath = 'datasetwebpath'
savepath = join(path_script, dataset, c_DateNow)
makedirs(savepath, exist_ok = True)
#"""
processes = [Process(target=down_file, args=(dspath, file, savepath, ret)) for file in filelist]
print(["dspath, %s, savepath, ret\n"%(file) for file in filelist])
# kick them off
for process in processes:
print("\n", process)
process.start()
# now wait for them to finish
for process in processes:
process.join()
#"""
####### This works and it's what i want to parallelize
"""
##Download files
for file in filelist:
down_file(dspath, file, savepath, ret)
#"""
################################
def main(c_DateNow, c_DateIni, c_DateFin):
## Other code
files=["list of web file addresses"]
print(" ...Files being downladed\n ", "\n ".join(files), "\n")
## Doanlad files
download_files(files, c_DateNow)
I want to download 25 files. When I run the code all the print lines that have been printed before in the code are being reprinted even though the Process
execution is not even near them. I am also getting the following error constantly
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
I googled the error and don't know how to fix it. Does it have to do with there not being enough cores? Is there a way to stop the Process depending on how many cores I have available? Or is it something else entirely?
In a question here, I read that the Process
has to be within the __main__
function but this code is a module that gets imported in another code so when I run it I run it as
import this_code
import another1_code
import another2_code
#Step1
another1_code.main()
#Step2
c_DateNow, c_DateIni, c_DateFin = another2_code.main()
#Step3
this_code.main(c_DateNow, c_DateIni, c_DateFin)
#step4
## More code
So I need the process to be within a function and not in __main__
I appreciate any help or suggestions on how to correctly parallelize the above code in a way that allows me to use it as a module in another code.