0

I have a 5000 urls to make request and check for a specific word inside source of each url

i want to do it as fast as possible, i am new to python

this is my code

import requests
def checkurl(url):
    r = requests.get(url)
    if 'House' in r.text:
        return True
    else:
        return False

if i do for loop it will take alot of time so i need a solution for multithreading or multi-processing

Thanks for the help in advance :)

  • It should honestly be avoided in loops, but I've added an answer for that case anyhow. I would recommend having a look at scrapy and / or reading this question: https://stackoverflow.com/questions/9110593/asynchronous-requests-with-python-requests – flindeberg Dec 26 '18 at 00:55

1 Answers1

1

Check out scrapy (at https://scrapy.org/), has tools for your purpose.

In my experience scrapy is better than just downloading "strings", since requests.get does not (as an example) actually render the page.

If you want to do it with requests anyhow (written in freehand, so might contain spelling / other errors):

import requests
from multiprocessing import ThreadPool    

def startUrlCheck(nr):
   pool = ThreadPool(threads)
   results = pool.map(checkurl, YourUrls)
   pool.close()
   pool.join()
   # Do something smart with results
   return results

def checkurl(url):
    r = requests.get(url)
    if 'House' in r.text:
        return True
    else:
        return False
flindeberg
  • 4,887
  • 1
  • 24
  • 37
  • my function works perfectly with the for loop but what i want to do is get results fast as possible, scrapy is little bit confusing for me as i am new to python , if you could help me with multithreading my function that'd be great anyways appreciate your answer bro :) – elrich bachman Dec 26 '18 at 00:32
  • @elrichbachman Check out my updated answer. Its quick and dirty but works. In general python would suggest you to use processes, which take up extra overhead, using threads is faster but "riskier", but shouldn't matter for this particular case. – flindeberg Dec 26 '18 at 00:53