Python Multiprocessing a large list with a loop

Question

Honestly, I was not even sure what to title this question. I am trying to loop through a large list of URLs, but only processing 20 URLs (20 is based on how many proxies I have) at a time. But I also need to keep looping through the proxy list, as I am processing the URLs. So, for example, it would start with the 1st URL and 1st proxy, and once it hits the 21st URL, it would use the 1st proxy again. Here is my poor example below, if anyone can even point me in the right direction, it would be much appreciated.

import pymysql.cursors
from multiprocessing import Pool
from fake_useragent import UserAgent

def worker(args):
    var_a, id, name, content, proxy, headers, connection = args
    print (var_a)
    print (id)
    print (name)
    print (content)
    print (proxy)
    print (headers)
    print (connection)
    print ('---------------------------')

if __name__ == '__main__':
    connection = pymysql.connect(
        host = 'host ',
        user = 'user',
        password = 'password',
        db = 'db',
        charset='utf8mb4',
        cursorclass=pymysql.cursors.DictCursor
    )

    ua = UserAgent()
    user_agent = ua.chrome
    headers = {'User-Agent' : user_agent}

    proxies = [
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx'
    ]

    with connection.cursor() as cursor:
        sql = "SELECT id,name,content FROM table"
        cursor.execute(sql)
        urls = cursor.fetchall()

    var_a = 'static'

    data = ((var_a, url['id'], url['name'], url['content'], proxies[i % len(proxies)], headers, connection) for i, url in enumerate(urls))
    proc_num = 20
    p = Pool(processes=proc_num)
    results = p.imap(worker, data)
    p.close() 
    p.join()

t.m.adam · Accepted Answer · 2017-08-18T17:36:03.000

1

You can use a list to store new processes. When you reach a certain number of items, call join for each process in the list. This should give you some control on the number of active processes.

if __name__ == '__main__':  
    proc_num = 20
    proc_list = []
    for i, url in enumerate(urls):
        proxy = proxies[i % len(proxies)] 
        p = Process(target=worker, args=(url, proxy))
        p.start()
        proc_list.append(p)
        if i % proc_num == 0 or i == len(urls)-1: 
            for proc in proc_list: 
                proc.join()

If you want a constant number of active processes you can try the Pool module. Just modify the worker definition to recieve a tuple.

if __name__ == '__main__': 
    data = ((url, proxies[i % len(proxies)]) for i, url in enumerate(urls))
    proc_num = 20
    p = Pool(processes=proc_num)
    results = p.imap(worker, data)
    p.close() 
    p.join()

Just to clarify things, the worker function should recieve a tuple and then unpack it.

def worker(args):
    var_a, id, name, content, proxy, headers, connection = args
    print (var_a)
    ... etc ...

edited Aug 18 '17 at 17:36

answered Aug 14 '17 at 00:33

t.m.adam

15,106
3
32
52

I have been testing the code you gave me, and it works to a degree. But I have while loop for when I am making a request and it does not break till the request goes through (sometimes the back connect proxy is bad and needs to wait to get a new one). But if that happens, it seems to wait for the while loop to complete, before any of the other links are requested. I thought the whole purpose of multiprocessing was being able to call the same function multiple times at once? Maybe I am misunderstanding how it works. – antfuentes87 Aug 14 '17 at 19:24
You could use `multiprocessing.Pool`, it should be much smoother. Also consider using a reasonable timeout (5 - 30 sec) in `requests.get`. – t.m.adam Aug 16 '17 at 06:20
That looks a lot more smooth. I see you are inputting "data" into the imap, but what if I have more variables I need to input into the function? I need to access url["name"], url["id"], etc... from urls. So little confused as how to add those variables into the imap. – antfuentes87 Aug 16 '17 at 14:47
Can you be more specific? `url` is a string, it doesn't have any keys. However you can modify the definition of `worker` to accept an arbitrary number of arguments: `def worker(*args):`, or build a "helper" function to unpack the arguments to `worker`, eg: `def helper(args): return worker(*args)` – t.m.adam Aug 16 '17 at 16:01
Yes, sorry, urls was just a example. urls is really a MySQL select query. So I need to be able to select the columns from that and input them into the function, along with the proxies (which are setup exactly how I have it in my example above) Hope that is a little more clear. – antfuentes87 Aug 16 '17 at 16:28
Ok, got it. You can either change the tuples in `data` ( eg: `(url["name"], url["id"], proxies[i % len(proxies)])` ), or change the `worker` function to handle the first argument as an sql query. If you still can't make it work i'll be happy to help as long as you include the updated code. – t.m.adam Aug 16 '17 at 17:07
My data looks like this: data = ((var_a, url['id'], url['name'], url['desc'], proxies[i % len(proxies)], headers, connection) for i, deal in enumerate(urls)) I have variables that are not in the SQL query that I am trying to pass as well (headers for the request, the SQL connection, to do inserts in the function, etc...) This is how my worker function looks like: def worker(var_a, id, name, desc, proxy, headers, connection): When I test it, nothing happens, the function never gets called. I will update my original question to a better example to show what I have currently. – antfuentes87 Aug 18 '17 at 14:59
Are you not getting any errors? The `worker` should recieve one argument, a tuple ( see my post ). Alternativly you can use a helper function to unpack the tuple, but that's an unnecessary complication. Please follow my instructions and let me know what happens. – t.m.adam Aug 18 '17 at 17:36
I updated my question to exact code I just tried now. When I run it, nothing prints out and no errors. If I print URLS, it shows the MySQL query results. – antfuentes87 Aug 18 '17 at 17:53
Yes, i run your code and It seems that you cant use a `pymysql.connect` object as an argument in `worker`. I realize that you need it to wride data to the db, and of course you can't initiate a new connection for each processes, that would be very consuming. The problem is that processes don't share memory, they only get a copy of the object. Perhaps you should consider multithreading. – t.m.adam Aug 18 '17 at 18:37
Yeah, I can give it a shot. Do you have a example of multithreading? – antfuentes87 Aug 18 '17 at 18:45
You can use the first code snippet in my post, just replace `Process` with `threading.Thread` – t.m.adam Aug 18 '17 at 18:54
Yeah, threading was a lot slower, but good news I figured it out! I just put the connection = pymysql.connect() right after the imports but before the worker function. And then just removed the connection variable from the data and worker function. I was still able to insert into the database, without having to pass the connection variable through the worker function. It is a lot faster with the Pool as well :) Thank you for your all your help! – antfuentes87 Aug 18 '17 at 19:28
Should you make note about the connection thing? Or does that not really matter? – antfuentes87 Aug 18 '17 at 19:33
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/152288/discussion-between-antfuentes87-and-t-m-adam). – antfuentes87 Aug 18 '17 at 19:34

score 0 · Answer 2 · edited Aug 13 '17 at 07:08

0

Try the below code:

for i in range(len(urls)):
    url = urls[i] # Current URL
    proxy = proxies[i % len(proxies)] # Current proxy
    # ...

edited Aug 13 '17 at 07:08

Pravitha V

3,308
4
33
51

answered Aug 13 '17 at 06:17

Oliver Ni

2,606
7
29
44

What about only spawning 20 processes (or however many proxies there are in the list) at a time? – antfuentes87 Aug 13 '17 at 07:08
When each process starts, add it to a counter. Remove it when it ends. In the for loop, check the counter before doing it. – Oliver Ni Aug 13 '17 at 07:43
I guess I am just confused. Won't the for loop, just make all the process start at once? So if I have 1000 links, will it not try to start 1000 processes? How do I only have it create 20 processes at a time? – antfuentes87 Aug 13 '17 at 15:39
I think I need something like this https://stackoverflow.com/questions/20190668/python-multiprocessing-a-for-loop (first answer), but how do I input the proxies in the function, cause in the answer, there is no loop used. He just inputs the array in the map. – antfuentes87 Aug 13 '17 at 15:52

Python Multiprocessing a large list with a loop

2 Answers2