0

The program checks if a url results in a 404, and if it does it writes a username to a file. I tried to add multiprocessing so that the program would run faster, as there were times where I would have input text files with 1000's of lines and it would take quite a while. However, the first time I run this program (when the output text file is empty), it doesn't write anything to the output text file. It only begins to write to the output file on the 2nd, 3rd, 4th ... run.

#program checks twitch accounts in a file.
#writes accounts which aren't taken to another file.
import requests
from multiprocessing import Pool
x = "0"

accounts = open('accounts.txt', 'r')
valid_accounts = open('valid accounts.txt', 'a')

base_url = "https://www.twitch.tv/"

def check(x):
    for line in accounts:
        url = base_url + line
        twitch_r = requests.get(url)
        if twitch_r.status_code == 404:
            valid_accounts.write(line + "\n")



def Main():
    p = Pool(processes=25)
    p.imap(check, x)
    accounts.close()
    valid_accounts.close()



if __name__ == "__main__":
    Main()

3 Answers3

0

You should call p.close() and then p.join() at the end of Main().

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • That fixed it. I understand that the close function is used when I no longer am going to use the pool, but I'm not sure what the join function does. From what I can gather, the join function waits for the worker processes to terminate, but shouldn't my code be be finished right after they terminate? What's there to wait for? – abc12333333 Feb 03 '17 at 23:18
  • @abc12333333 `join()` waits until all processes are done. Your program above starts the processes and immediately exits and makes the processes die as well – hansaplast Feb 04 '17 at 08:05
0

You are not passing your accounts to pool map

p.imap(check, accounts)
iklinac
  • 14,944
  • 4
  • 28
  • 30
  • I think I coded it poorly, but in the code posted above the argument being passed to my pool.imap was just random because I needed an argument. The check function itself was getting the accounts from the file. – abc12333333 Feb 03 '17 at 23:29
0

Your main problem is that you use imap instead of map. imap is nonblocking, that means that your main process quits before the processes run through. I'm a bit suprised that it worked sometimes as I think it should have worked never.

That said, there are a few problems with your program:

  • the check method, running in different processes, shares one file handler and iterates over the lines. This is just working by chance (it will not work on Windows, for instance) and is bad practice (to put it midly). You should read the file first and then distribute the lines to the processes
  • same thing applies with writing to the file. Although appending to a file is safe to do also across processes, a better design would be to put that at the end into the parent process
  • map and imap are thought to run on a list of arguments and then return the result (mapped value)
  • leave out the processes=20 so python can find out the best number of processes based on how many cores your computer has

Based on these things, that's the code I propose:

# program checks twitch accounts in a file.
# writes accounts which aren't taken to another file.
import requests
from multiprocessing import Pool, Queue

base_url = "https://www.twitch.tv/"

def check(line):
    twitch_r = requests.get(base_url + line)
    if twitch_r.status_code == 404:
        return line

def Main():
    queue_in = Queue()
    queue_out = Queue()
    p = Pool()

    with open('accounts.txt', 'r') as accounts:
        lines = accounts.readlines()

    results = p.map(check, lines)
    results = [r for r in results if r != None]
    with open('valid accounts.txt', 'a') as valid_accounts:
        for result in results:
            valid_accounts.write(result)

if __name__ == "__main__":
    Main()

The only thing to be noted is that you need to strip out the None in results because check(line) returns None for all the urls which result is not a 404.

Updates:

After using John's solution, the program is working as intended

I doubt it does. Since you are on windows, every process has it's own filehandler pointing to accounts.txt and will cycle through all the lines. So you end up checking every url 20 times and the multiprocessing didn't help you

I used imap because I read that imap doesn't return a list (?)

No. The difference of map vs. imap in this situation is only that map waits until all processes are done (thus, you don't need to call join).

For a more thorough discussion about map vs imap see here

Community
  • 1
  • 1
hansaplast
  • 11,007
  • 2
  • 61
  • 75
  • After using John's solution, the program is working as intended. It was working on the runs after the first one on Windows before I had changed anything. I do agree with you that I should have read the lines first and then sent that to the arguments of the pool, but I'm not sure how to make it so that each process takes in a different argument. If I make a for loop for each line in the file, I would be using line as an argument for my pool function, so if I'm not wrong, every process would be using the same argument. I used imap because I read that imap doesn't return a list (?). 404 is right. – abc12333333 Feb 03 '17 at 23:24
  • @abc12333333 if you want to use map and make it so that "each process takes a different argument" then have a look at my solution. That's exactly what it does. map takes one value from `lines` and hands it to map and then stores the `return` in `results`. And this is done in parallel – hansaplast Feb 04 '17 at 08:26
  • @abc12333333 your other questions I added to my answer. About the 404: I have now changed that so it does what you want. Please try my solution above, it should just work, it works on my machine – hansaplast Feb 04 '17 at 08:35
  • I tried your solution and it is much faster than the program I initially created, so that must mean that what I had wasn't using multiprocessing correctly. I don't think I fully understand how multiprocessing works. For the following line: `results = p.map(check, lines)` Lines is a list, how do the processes each pick a different element of the line rather than all of them going from element 0 to the end of the list? – abc12333333 Feb 05 '17 at 01:24
  • @abc12333333 sure it runs much faster, as in your program every process checked the url again, so you ended up with 20 processes who all did the same. Check my answer after "After using John's solution, the program is working as intended" – hansaplast Feb 05 '17 at 06:11
  • in this line: `results = p.map(check, lines)` map splits up the `lines` list and sends each process only one element, and goes sure that every element is only sent once to a process. that's why your function `check` only takes an element and not a list – hansaplast Feb 05 '17 at 06:13