0

I'm trying to improve the speed of my program and I decided to use multiprocessing!

the problem is I can't seem to find any way to use the pool function (i think this is what i need) to use my function

here is the code that i am dealing with:

def dataLoading(output):
    name = ""
    link = ""
    upCheck = ""
    isSuccess = ""
    for i in os.listdir():
        with open(i) as currentFile:
            data = json.loads(currentFile.read())
            try:
                name = data["name"]
                link = data["link"]
                upCheck = data["upCheck"]
                isSuccess = data["isSuccess"]
            except:
                print("error in loading data from config: improper naming or formating used")
            output[name] = [link, upCheck, isSuccess]

#working
def userCheck(link, user, isSuccess):
    link = link.replace("<USERNAME>", user)
    isSuccess = isSuccess.replace("<USERNAME>", user)
    html = requests.get(link, headers=headers)
    page_source = html.text
    count = page_source.count(isSuccess)
    if count > 0:
        return True
    else:
        return False

I have a parent function to run these two together but I don't think i need to show the whole thing, just the part that gets the data iteratively:

    for i in configData:
        data = configData[i]
        link = data[0]
        print(link)
        upCheck = data[1] #just for future use
        isSuccess = data[2]
        if userCheck(link, username, isSuccess) == True:
            good.append(i)

you can see how I enter all of the data in there, how would I be able to use multiprocessing to do this when I am iterating through the dictionary to collect multiple parameters?

AMC
  • 2,642
  • 7
  • 13
  • 35
Ironkey
  • 2,568
  • 1
  • 8
  • 30
  • 1
    _the problem is I can't seem to find any way to use the pool function (i think this is what i need) to use my function_ Can you be more specific about what the issue is? As an aside, be careful when using a bare except, see https://stackoverflow.com/questions/54948548/what-is-wrong-with-using-a-bare-except. Also, variable and function names should generally follow the `lower_case_with_underscores` style. – AMC Apr 28 '20 at 00:25

1 Answers1

1

I like to use mp.Pool().map. I think it is easiest and most straight forward and handles most multiprocessing cases. So how does map work? For starts, we have to keep in mind that mp creates workers, each worker receives a copy of the namespace (ya the whole thing), then each worker works on what they are assigned and returns. Hence, doing something like "updating a global variable" while they work, doesn't work; since they are each going to receive a copy of the global variable and none of the workers are going to be communicating. (If you want communicating workers you need to use mp.Queue's and such, it gets complicated). Anyway, here is using map:

from multiprocessing import Pool

t = 'abcd'
def func(s):
    return t[int(s)]

results = Pool().map(func,range(4))

Each worker received a copy of t, func, and the portion of range(4) they were assigned. They are then automatically tracked and everything is cleaned up in the end by Pool.

Something like your dataLoading won't work very well, we need to modify it. I also cleaned the code a little.

def loadfromfile(file):
    data  = json.loads(open(file).read())
    items = [data.get(k,"") for k in ['name','link','upCheck','isSuccess']]
    return items[0],items[1:]

output = dict(Pool().map(loadfromfile,os.listdir()))
Bobby Ocean
  • 3,120
  • 1
  • 8
  • 15
  • wow, thanks! so if I wanted to run usercheck in a pool would i do something like this? `p.map(userCheck, output)` – Ironkey Apr 28 '20 at 00:21
  • 1
    Right, Pool().map(userCheck, output.items()). userCheck needs to be updated to take in a name,items, where items is the link, user, isSuccess. – Bobby Ocean Apr 28 '20 at 00:41
  • when i try my code, it opens a bunch of python instances in task manager that takes up 97% cpu usage. i think it just isn't stopping, https://pastebin.com/RAkpthQF – Ironkey Apr 28 '20 at 15:03
  • Not sure what you are asking. Multiprocessing is suppose to take up 100% of your CPU that is the point. I don't know how much data or files you have in your directory. I don't see a bug, but I could have made a mistake. – Bobby Ocean Apr 29 '20 at 00:47
  • I see that it takes up CPU, but the process never finishes, so im not sure if it has completed the task or when it's going to stop. im continuing to play with it though ill find something :D – Ironkey Apr 29 '20 at 13:49
  • Shouldn't be hanging around that long, Python's garbage collector is really good. If it is still giving you problems, you can explicitly .close() the Pool(). Like pool = mp.Pool(), do stuff with pool, like pool.map(), and then pool.close() when finished. You can use the context manager system if you would like as well. – Bobby Ocean Apr 29 '20 at 16:24