I am trying to write a crawler for a web security project, and I'm having strange behaviour with a method using multiprocessing.
What should this method do? It iterates over found target web pages, with a list of found query parameters. For each web page, it should apply the method phase1 (my attack logic) to every query parameter associated with that page.
Meaning, if I have http://example.com/sub.php, having page &secret as query parameters, and http://example.com/s2.php, having topsecret as parameter, it should do the following:
- attack page from http://example.com/sub.php
- attack secret from http://example.com/sub.php
- attack topsecret from http://example.com/s2.php
I know if an attack is happening, based on time and output of phase1.
What actually happens
Only the first attack is executed. The following calls to apply_async are ignored. However, it still cycles through the loop, since it still prints the output from above for loop.
What is going wrong here? Why is the attack routine not triggered? I have looked up the docs for multiprocessing, but it doesn't help explaining this phenomenon.
Some answers in related problems suggested using terminate and join, but insn't this done implicitely here, since I'm using the with statement?
Also, this question (Multiprocessing pool 'apply_async' only seems to call function once) sounds very similar, but is different from my problem. In contrary to that question, I don't have the problem that only 1 worker executes the code, but that my X workers are only spawned once (instead of Y times).
What I've tried: putting with..Pool outside of loops, but nothing changed
The method in question is the following:
def analyzeParam(siteparams, paysplit, victim2, verbose, depth, file, authcookie):
result = {}
subdir = parseUrl(viclist[0])
for victim, paramlist in siteparams.items():
sub = {}
print("\n{0}[INFO]{1} param{4}|{2} Attacking {3}".format(color.RD, color.END + color.O, color.END, victim, color.END+color.RD))
time.sleep(1.5)
for param in paramlist:
payloads = []
nullbytes = []
print("\n{0}[INFO]{1} param{4}|{2} Using {3}\n".format(color.RD, color.END + color.O, color.END, param, color.END+color.RD))
time.sleep(1.5)
with Pool(processes=processes) as pool:
res = [pool.apply_async(phase1, args=(1,victim,victim2,param,None,"",verbose,depth,l,file,authcookie,"",)) for l in paysplit]
for i in res:
#fetch results
tuples = i.get()
payloads += tuples[0]
nullbytes += tuples[1]
sub[param] = (payloads, nullbytes)
time.sleep(3)
result[victim] = sub
if not os.path.exists(cachedir+subdir):
os.makedirs(cachedir+subdir)
with open(cachedir+subdir+"spider-phase2.json", "w+") as f:
json.dump(result, f, sort_keys=True, indent=4)
return result
Some technical information:
- Python version: 3.8.5
- I doubt that the bug lies in phase1, since when called with Pool outside of a loop, but multiple times, it acts as intended. If you want to look it up, the source code is here: https://github.com/VainlyStrain/Vailyn
How do I fix this? Thanks!