0

I am downloading a lot of files from a website and want them to run parallel because they are heavy. Unfourtanetly I can't really share the website because to access the files I need a username and password which I can't share. The code below is my code, I know it can't really be run without the website and my username and password but I am 99% sure I am not allowed to share that information

import os 
import requests
from multiprocessing import Process

dataset="dataset_name"

################################
def down_file(dspath, file, savepath, ret):
    webfilename = dspath+file
    file_base = os.path.basename(file)
    file = join(savepath, file_base)
    print('...Downloading',file_base)
 
    req = requests.get(webfilename, cookies = ret.cookies, allow_redirects=True, stream=True)
    filesize = int(req.headers['Content-length'])
    with open(file, 'wb') as outfile:
        chunk_size=1048576
        for chunk in req.iter_content(chunk_size=chunk_size):
            outfile.write(chunk)

    return None

################################
##Download files
def download_files(filelist, c_DateNow):
    ## Authenticate    
    url = 'url'
    values = {'email' : 'email', 'passwd' : "password", 'action' : 'login'}
    ret = requests.post(url, data=values)

    ## Path to files
    dspath = 'datasetwebpath'
    
    savepath = join(path_script, dataset, c_DateNow)
    makedirs(savepath, exist_ok = True)

    #"""
    processes = [Process(target=down_file, args=(dspath, file, savepath, ret)) for file in filelist]
    print(["dspath, %s, savepath, ret\n"%(file) for file in filelist])
    
    # kick them off 
    for process in processes:
        print("\n", process)
        process.start()

    # now wait for them to finish
    for process in processes:
        process.join()

    #"""

    ####### This works and it's what i want to parallelize
    """
    ##Download files
    for file in filelist:
        down_file(dspath, file, savepath, ret)
    #"""

################################
def main(c_DateNow, c_DateIni, c_DateFin):    
    ## Other code
    files=["list of web file addresses"] 
    print("   ...Files being downladed\n     ", "\n      ".join(files), "\n")


    ## Doanlad files
    download_files(files, c_DateNow)

I want to download 25 files. When I run the code all the print lines that have been printed before in the code are being reprinted even though the Process execution is not even near them. I am also getting the following error constantly

     An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

I googled the error and don't know how to fix it. Does it have to do with there not being enough cores? Is there a way to stop the Process depending on how many cores I have available? Or is it something else entirely?

In a question here, I read that the Process has to be within the __main__ function but this code is a module that gets imported in another code so when I run it I run it as

import this_code 
import another1_code 
import another2_code 

#Step1
another1_code.main()

#Step2
c_DateNow, c_DateIni, c_DateFin = another2_code.main()

#Step3
this_code.main(c_DateNow, c_DateIni, c_DateFin)

#step4
## More code

So I need the process to be within a function and not in __main__

I appreciate any help or suggestions on how to correctly parallelize the above code in a way that allows me to use it as a module in another code.

M.O.
  • 476
  • 7
  • 19
  • You really don't want to use multiprocessing here. You want to log into the web site, and then make asynchronous requests. See https://stackoverflow.com/questions/9110593/asynchronous-requests-with-python-requests – Frank Yellin Sep 28 '21 at 17:16
  • Thanks! I had no idea of that term. I read the stuff in that link and dont really get it yet but I will keep reading. Do asynchronous requests use up all available cores? – M.O. Sep 28 '21 at 17:21
  • Async won't use all the cores. But if the file size is slowing things down, then your bandwidth may be the limitation, not your CPU. It really depends on what you mean by "want them to run parallel because they are heavy.": What do you mean by "run (the file)" and what do you mean by "heavy"? – 9769953 Sep 28 '21 at 17:30
  • I have a access to a server where I intend on running the code in 20 cores, so ideally one core will download one file and all 20 will do it simultaneously. Each file is around 700MB and the server connection is slow so it takes like 1 minute and a half to download (1.5 min *25 is too long!). We need for the code to be as fast as possible, which is why I thought multiprocessing was the best since I actually want each core to download one file. Ram is not an issue either. – M.O. Sep 28 '21 at 17:35
  • 1
    "This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module" this is clearly explained in the multiprocessing docs. You aren't showing how you call this code, but it implies you aren't using the `if __name__ == "__main__"` guard. " "So I need the process to be within a function and not in __main__" **you must put it inside a `__main__` guard**. Just call the function in the `__main__` guard, i.e. refactor the main script to do the work in the `__main__` guard. – juanpa.arrivillaga Sep 28 '21 at 17:36
  • @M.O. your main bottleneck will be I/O. Multiprocessing does work here. – juanpa.arrivillaga Sep 28 '21 at 17:37
  • @juanpa.arrivillaga Yeah I am not calling it inside `__main__`. I explained at the end of the post how I am calling it as a module inside another code and that I need to keep that functionality. – M.O. Sep 28 '21 at 17:39
  • @M.O. you *can* keep that functionality. Put the code in that module **inside a `__main__` block**. This is pretty standard anyway, without multiprocessing – juanpa.arrivillaga Sep 28 '21 at 17:42
  • @juanpa.arrivillaga Maybe i am confused, I thought that everything inside `__main__` gets executed as soon as you import the code as a module. In which case it wouldn't work for me because it uses inputs from another code and I need for the other codes to execute before downloading all the files. Am I confused? – M.O. Sep 28 '21 at 17:45
  • @M.O. it sounds like you are confused. The whole point is that everything insdie `__main__` *doesn't* get executed when you import the module (this is why it is necessary to use in multiprocessing when you don't use the fork method and instead use spawn). In any case, the reasons you are giving don't make much sense. You would *literally just indent all your code after the imports and put it inside the main block*, and *nothing would change about the way it currently works* (except that multiprocessing would actually work) – juanpa.arrivillaga Sep 28 '21 at 17:49
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/237613/discussion-between-m-o-and-juanpa-arrivillaga). – M.O. Sep 28 '21 at 17:56

0 Answers0