2

I have been trying to figure out how I can speed up and also get some knowledge with threading.

I have been trying to create a function where I have put two GET requests. For each link I scrape some data and then I save it to a list which returns and then I will use that to compare to see if there has appeared new links in one of these links:

"""
def getScrapeLinks(self, siteURL):
    response = requests.get(
                    siteURL,
                    timeout=5
                )

    if response.ok:
        bs4 = soup(response.text, 'lxml')

        links = ['{}'.format( raw_product.find('a').get('href')) for
                    raw_product in bs4.find_all('div', {'class': 'test'})]

        return links

"""

def pollNewProducts(self, storeClass):

    # storeClass.siteCatalog = ["https://www.google.com", "https://www.facebook.com"]

    LinksLists = reduce(operator.add, [getScrapeLinks(getLinks) for getLinks in storeClass.siteCatalog])

    while True:

        newLinksLists = reduce(operator.add,
                                 [getScrapeLinks(getLinks) for getLinks in storeClass.siteCatalog]
                                 )

        for URL in newLinksLists:
            if URL not in LinksLists:
                print("New link")
                print(URL)
                LinksLists.append(URL)
        else:
            print("Sleep to see new links!")
            time.sleep(random.randint(2, 4))

For now my problem is that I use "reduce" function where it first does the first request for etc Google and then get the needed data and then once that is finished then I do the second request which is Facebook. What I want to do here is to speed it to make each link to have its own threading so it can be runnable simultaneously instead of being dependent on each link.

I wonder, How can I run each link by itself and still be able to compare and get the new URL if there appears a new URL in the GET request?

PythonNewbie
  • 1,031
  • 1
  • 15
  • 33

1 Answers1

2

Adapting from my answer to this question.


You should look into asynchronous programming. Different from thread, asynchronous code runs in the same thread, but it runs inside an event loop. This event loop automatically switches context between different operations when the Python keyword await is present.

In other words, think of scraping websites as the following:

client sends request -> ... waiting for server reply ... <- server replies

Sending a request is an operation that takes a very small amount of time and consumes almost no resources. The real time consumer is waiting for the server to respond, and then processing the server's reply. If instead we do something that resembles the following:

client sends request -> switch operation -> ... wait ... <- server replies
client sends request -> switch operation -> ... wait ... <- server replies
client sends request -> switch operation -> ... wait ... <- server replies
...

Then we can minimize our time waiting for the server to reply, and instead already be shooting the next request over. In other words, what we can effectively do is tell Python to send the request, and then instantly switch to a different part of our code that sends another request, and then another part that sends another request, and so on. When all the requests are sent, we can come back and start interpreting the individual server replies.

There is a lot of references online on how to program asynchronously in Python (using the built-in asynchro module + PyPi installable aiohttp module), and I would suggest Googling away. Here is a code sample that will take less than 4 seconds to scrape over 100 websites (note that this scales extremely well, and 4 seconds is actually due to the print statements... without, its actually closer to 2 seconds):

import asyncio
import aiohttp
import time


websites = """https://www.youtube.com
https://www.facebook.com
https://www.baidu.com
https://www.yahoo.com
https://www.amazon.com
https://www.wikipedia.org
http://www.qq.com
https://www.google.co.in
https://www.twitter.com
https://www.live.com
http://www.taobao.com
https://www.bing.com
https://www.instagram.com
http://www.weibo.com
http://www.sina.com.cn
https://www.linkedin.com
http://www.yahoo.co.jp
http://www.msn.com
http://www.uol.com.br
https://www.google.de
http://www.yandex.ru
http://www.hao123.com
https://www.google.co.uk
https://www.reddit.com
https://www.ebay.com
https://www.google.fr
https://www.t.co
http://www.tmall.com
http://www.google.com.br
https://www.360.cn
http://www.sohu.com
https://www.amazon.co.jp
http://www.pinterest.com
https://www.netflix.com
http://www.google.it
https://www.google.ru
https://www.microsoft.com
http://www.google.es
https://www.wordpress.com
http://www.gmw.cn
https://www.tumblr.com
http://www.paypal.com
http://www.blogspot.com
http://www.imgur.com
https://www.stackoverflow.com
https://www.aliexpress.com
https://www.naver.com
http://www.ok.ru
https://www.apple.com
http://www.github.com
http://www.chinadaily.com.cn
http://www.imdb.com
https://www.google.co.kr
http://www.fc2.com
http://www.jd.com
http://www.blogger.com
http://www.163.com
http://www.google.ca
https://www.whatsapp.com
https://www.amazon.in
http://www.office.com
http://www.tianya.cn
http://www.google.co.id
http://www.youku.com
https://www.example.com
http://www.craigslist.org
https://www.amazon.de
http://www.nicovideo.jp
https://www.google.pl
http://www.soso.com
http://www.bilibili.com
http://www.dropbox.com
http://www.xinhuanet.com
http://www.outbrain.com
http://www.pixnet.net
http://www.alibaba.com
http://www.alipay.com
http://www.chrome.com
http://www.booking.com
http://www.googleusercontent.com
http://www.google.com.au
http://www.popads.net
http://www.cntv.cn
http://www.zhihu.com
https://www.amazon.co.uk
http://www.diply.com
http://www.coccoc.com
https://www.cnn.com
http://www.bbc.co.uk
https://www.twitch.tv
https://www.wikia.com
http://www.google.co.th
http://www.go.com
https://www.google.com.ph
http://www.doubleclick.net
http://www.onet.pl
http://www.googleadservices.com
http://www.accuweather.com
http://www.googleweblight.com
http://www.answers.yahoo.com"""


async def get(url):
    try:
        async with aiohttp.ClientSession() as session:
            async with session.get(url=url) as response:
                resp = await response.read()
                print("Successfully got url {} with response of length {}.".format(url, len(resp)))
    except Exception as e:
        print("Unable to get url {} due to {}.".format(url, e.__class__))


async def main(urls, amount):
    ret = await asyncio.gather(*[get(url) for url in urls])
    print("Finalized all. ret is a list of len {} outputs.".format(len(ret)))


urls = websites.split("\n")
amount = len(urls)

start = time.time()
asyncio.run(main(urls, amount))
end = time.time()

print("Took {} seconds to pull {} websites.".format(end - start, amount))

Outputs:

Successfully got url http://www.google.com.br with response of length 12188.
Successfully got url http://www.google.it with response of length 12155.
Successfully got url https://www.t.co with response of length 0.
Successfully got url http://www.msn.com with response of length 46335.
Successfully got url http://www.chinadaily.com.cn with response of length 122053.
Successfully got url https://www.google.co.in with response of length 11557.
Successfully got url https://www.google.de with response of length 12135.
Successfully got url https://www.facebook.com with response of length 115258.
Successfully got url http://www.gmw.cn with response of length 120866.
Successfully got url https://www.google.co.uk with response of length 11540.
Successfully got url https://www.google.fr with response of length 12189.
Successfully got url http://www.google.es with response of length 12163.
Successfully got url http://www.google.co.id with response of length 12169.
Successfully got url https://www.bing.com with response of length 117915.
Successfully got url https://www.instagram.com with response of length 36307.
Successfully got url https://www.google.ru with response of length 12128.
Successfully got url http://www.googleusercontent.com with response of length 1561.
Successfully got url http://www.xinhuanet.com with response of length 179254.
Successfully got url http://www.google.ca with response of length 11592.
Successfully got url http://www.accuweather.com with response of length 269.
Successfully got url http://www.googleadservices.com with response of length 1561.
Successfully got url https://www.whatsapp.com with response of length 77951.
Successfully got url http://www.cntv.cn with response of length 3139.
Successfully got url http://www.google.com.au with response of length 11579.
Successfully got url https://www.example.com with response of length 1270.
Successfully got url http://www.google.co.th with response of length 12151.
Successfully got url https://www.amazon.com with response of length 465905.
Successfully got url https://www.wikipedia.org with response of length 76240.
Successfully got url https://www.google.co.kr with response of length 12211.
Successfully got url https://www.apple.com with response of length 63322.
Successfully got url http://www.uol.com.br with response of length 333257.
Successfully got url https://www.aliexpress.com with response of length 59742.
Successfully got url http://www.sohu.com with response of length 215201.
Successfully got url https://www.google.pl with response of length 12144.
Successfully got url https://www.googleweblight.com with response of length 0.
Successfully got url https://www.cnn.com with response of length 1138392.
Successfully got url https://www.google.com.ph with response of length 11561.
Successfully got url https://www.linkedin.com with response of length 71498.
Successfully got url https://www.naver.com with response of length 176038.
Successfully got url https://www.live.com with response of length 3667.
Successfully got url https://www.twitch.tv with response of length 61599.
Successfully got url http://www.163.com with response of length 696338.
Successfully got url https://www.ebay.com with response of length 307068.
Successfully got url https://www.wordpress.com with response of length 76680.
Successfully got url https://www.wikia.com with response of length 291400.
Successfully got url http://www.chrome.com with response of length 161223.
Successfully got url https://www.twitter.com with response of length 291741.
Successfully got url https://www.stackoverflow.com with response of length 105987.
Successfully got url https://www.netflix.com with response of length 83125.
Successfully got url https://www.tumblr.com with response of length 78110.
Successfully got url http://www.doubleclick.net with response of length 129901.
Successfully got url https://www.yahoo.com with response of length 531829.
Successfully got url http://www.soso.com with response of length 174.
Successfully got url https://www.microsoft.com with response of length 187549.
Successfully got url http://www.office.com with response of length 89556.
Successfully got url http://www.alibaba.com with response of length 167978.
Successfully got url https://www.reddit.com with response of length 483295.
Successfully got url http://www.outbrain.com with response of length 24432.
Successfully got url http://www.tianya.cn with response of length 7941.
Successfully got url https://www.baidu.com with response of length 156768.
Successfully got url http://www.diply.com with response of length 3074314.
Successfully got url http://www.blogspot.com with response of length 94478.
Successfully got url http://www.popads.net with response of length 14548.
Successfully got url http://www.answers.yahoo.com with response of length 104726.
Successfully got url http://www.blogger.com with response of length 94478.
Successfully got url http://www.imgur.com with response of length 4008.
Successfully got url http://www.qq.com with response of length 244841.
Successfully got url http://www.paypal.com with response of length 45587.
Successfully got url http://www.pinterest.com with response of length 45692.
Successfully got url http://www.github.com with response of length 86917.
Successfully got url http://www.zhihu.com with response of length 31473.
Successfully got url http://www.go.com with response of length 594291.
Successfully got url http://www.fc2.com with response of length 34546.
Successfully got url https://www.amazon.de with response of length 439209.
Successfully got url https://www.youtube.com with response of length 439571.
Successfully got url http://www.bbc.co.uk with response of length 321966.
Successfully got url http://www.tmall.com with response of length 234388.
Successfully got url http://www.imdb.com with response of length 289339.
Successfully got url http://www.dropbox.com with response of length 103714.
Successfully got url http://www.bilibili.com with response of length 50959.
Successfully got url http://www.jd.com with response of length 18105.
Successfully got url http://www.yahoo.co.jp with response of length 18565.
Successfully got url https://www.amazon.co.jp with response of length 479721.
Successfully got url http://www.craigslist.org with response of length 59372.
Successfully got url https://www.360.cn with response of length 74502.
Successfully got url http://www.ok.ru with response of length 170516.
Successfully got url https://www.amazon.in with response of length 460696.
Successfully got url http://www.booking.com with response of length 408992.
Successfully got url http://www.yandex.ru with response of length 116661.
Successfully got url http://www.nicovideo.jp with response of length 107271.
Successfully got url http://www.onet.pl with response of length 720657.
Successfully got url http://www.alipay.com with response of length 21698.
Successfully got url https://www.amazon.co.uk with response of length 443607.
Successfully got url http://www.sina.com.cn with response of length 579107.
Successfully got url http://www.hao123.com with response of length 295213.
Successfully got url http://www.pixnet.net with response of length 6295.
Successfully got url http://www.coccoc.com with response of length 45822.
Successfully got url http://www.taobao.com with response of length 393128.
Successfully got url http://www.weibo.com with response of length 95482.
Successfully got url http://www.youku.com with response of length 762485.
Finalized all. ret is a list of len 100 outputs.
Took 3.899034023284912 seconds to pull 100 websites.

As you can see 100 websites from across the world were successfully reached (with or without https) in about 4 seconds with aiohttp on my internet connection (Miami, Florida). Keep in mind the following can slow down the program by a few ms:

  • print statements (yes, including the ones placed in the code above).
  • Reaching servers further away from your geographical location.

The example above has both instances of the above, and therefore it is arguably the least-optimized way of doing what you have asked. However, I do believe it is a great start for what you are looking for.

felipe
  • 7,324
  • 2
  • 28
  • 37
  • Got it! seems like its abit more complicated than what I did where I using threading instead. However I wasnt sure by reading your answer if it runs simultaneously or does it need to wait for the async for the first page to be finished to be able to go to the second async request? Or does it fire all links that are in the lists and then do a async for all links in a list and then wait until they all are finished? – PythonNewbie May 20 '20 at 23:02
  • 1
    `await asyncio.gather(*[get(url) for url in urls])` we are calling all of our `get()` functions with all of the urls. `resp = await response.read()` we switch to a different context. Essentially, when we get to the `await` keyword we go back to the event loop and say "ok, what else is there for me to do so that I'm not idle?" What there is to do is continue processing more `get()` functions. Then once you reach the last `get()` function, you go to first `get()` function and say "awesome, did you finish getting?" in most cases it has, if not, it waits until at least one response comes through. – felipe May 20 '20 at 23:14
  • 2
    To answer more directly, asynchronous code does _**not** run simultaneously_. The entire purpose of asynchronous code is to allow you to start operations that initiate an external waiting period one after the other without having to wait for each external waiting period to finish. This could be reading millions of files from disk (there is a delay reading files from disk), sending millions of requests into the internet (delay on server responses), etc. Lmk if that makes sense. – felipe May 20 '20 at 23:21
  • Alright that seems to be something that I am actually looking for. I will however need to workout abit to see how I can transfer/covert the code I am using to a async sample that you gave. Is there any tip that you can suggest already now to something I should think about? – PythonNewbie May 20 '20 at 23:25
  • Absoloutly, I think I do get a hang of what the purpose is as you mentioned *allow you to start operations that initiate an external waiting period one after the other without having to wait for each external waiting period to finish.* - that was kinda my idea. Where I want urls to run without waiting for each other and then collect to a single list. If I am not incorrect now again.. :) – PythonNewbie May 20 '20 at 23:30
  • 1
    Seems like you got the idea. :) I would suggest getting comfortable with `asyncio` -- it's a very powerful module, if I may say so myself. – felipe May 21 '20 at 00:44
  • Hey again! After been testing out async I realized that there might be a problem for me. example could be that etc 99 sites loads 3sec while the last url can take over 15 sec. The problem is that all of the sites are waiting for the last one to be finished else it wont continue the process until it is finished which will not run as I really wanted :( – PythonNewbie May 21 '20 at 08:14
  • This is where asynchronous programming will start to bend the way you view programming all together. In the example above all we are doing is scraping a bunch of websites, so in the end, if there is _one_ hanging the whole program will hang before stopping. In your case, where you want to process the website afterwards, you technically wouldn't care if 1 of 100 requests is taking 15 seconds longer because during that interim you would be processing the other websites. The key to asynchronous programming is creating a "framework" that allow it to continuously "jump" from ops to ops effectively. – felipe May 21 '20 at 14:24
  • 1
    Oh yeah, i think I will need to read aobut async more and apply to have it exactly as I want :D – PythonNewbie May 21 '20 at 14:52
  • 1
    Exactly. Once you get the hang of it things become much more fun/interesting. :) Good luck! – felipe May 21 '20 at 17:14