2

So i've been scraping a website (www.cardsphere.com) protected pages with requests, using session, like so:

import requests

payload = {
            'email': <enter-email-here>,
            'password': <enter-site-password-here>
          }

with requests.Session() as request:
   requests.get(<site-login-page>)
   request.post(<site-login-here>, data=payload)
   request.get(<site-protected-page1>)
   save-stuff-from-page1
   request.get(<site-protected-page2>)
   save-stuff-from-page2
   .
   .
   .
   request.get(<site-protected-pageN>)
   save-stuff-from-pageN
the-end

Now since it's quite a bit of pages i wanted to speed it up with Aiohttp + asyncio...but i'm missing something. I've been able to more or less use it to scrape unprotected pages, like so:

import asyncio
import aiohttp

async def get_cards(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            data = await resp.text()
            <do-stuff-with-data>

urls  = [
         'https://www.<url1>.com'
         'https://www.<url2>.com'
         .
         .
         . 
         'https://www.<urlN>.com'
        ]

loop = asyncio.get_event_loop()
loop.run_until_complete(
    asyncio.gather(
        *(get_cards(url) for url in urls)
    )
)

That gave some results but how do i do it for pages that require login? I tried adding session.post(<login-url>,data=payload) inside the async function but that obviously didn't work out well, it will just keep logging in. Is there a way to "set" an aiohttp ClientSession before the loop function? As i need to login first and then, on the same session, get data from a bunch of protected links with asyncio + aiohttp?

Still rather new to python, async even more so, i'm missing some key concept here. If anybody would point me in the right direction i'll greatly appreciate it.

Last_crusaider
  • 33
  • 1
  • 1
  • 4
  • It will depend on the authentication method being used by the website how you authenticate your requests. Does the website use Basic Auth, OAuth, is it token based, etc? – Timothy Jannace Mar 19 '19 at 21:46
  • @Timothy Jannace : i access the website exactly like posted, i make a get to the login page first, then make a post to the same login-url but with the payload with my login information...so cookie based? Not much of an expert on this, edited to add the website. – Last_crusaider Mar 19 '19 at 22:16
  • ah, I see. So after logging in the auth cookie is saved to the session which allows the subsequent logins. In that case the authentication call must be done first (and will block the other calls) and the subsequent calls can be done async. – Timothy Jannace Mar 19 '19 at 22:20
  • @Timothy Jannace yeah but how do i do that? with a simple request first or some other way? I tried multiple combinations, like doing a post after `with aiohttp.ClientSession() as session: ` before doing the the async loop functions but that didn't work – Last_crusaider Mar 19 '19 at 23:45

1 Answers1

2

This is the simplest I can come up with, depending on what you do in <do-stuff-with-data> you may run into some other troubles regarding concurrency, down the rabbit hole you go... just kidding, its a little bit more complicated to wrap your head around coros and promises and tasks but once you get it is as simple as sequential programming

import asyncio
import aiohttp


async def get_cards(url, session, sem):
    async with sem, session.get(url) as resp:
        data = await resp.text()
        # <do-stuff-with-data>


urls = [
    'https://www.<url1>.com',
    'https://www.<url2>.com',
    'https://www.<urlN>.com'
]


async def main():
    sem = asyncio.Semaphore(100)
    async with aiohttp.ClientSession() as session:
        await session.get('auth_url')
        await session.post('auth_url', data={'user': None, 'pass': None})
        tasks = [asyncio.create_task(get_cards(url, session, sem)) for url in urls]
        results = await asyncio.gather(*tasks)
        return results


asyncio.run(main())
Dalvenjia
  • 1,953
  • 1
  • 12
  • 16
  • That seems to be working but it's taking about the same time (around 30 minutes for 35.000 urls). What i do after i get the data from url is only fill a dict with some info from it. Also getting a `concurrent.futures._base.CancelledError` after `results = await asyncio.gather(*tasks)` – Last_crusaider Mar 19 '19 at 23:48
  • @Last_crusaider Posting a full backtrace for the cancelled error might be enlightening. Also, for such a large number of URLs, you probably want to add a semaphore that limits their number. E.g. add `sem = Semaphore(100)` to `main()`, then pass `sem` go `get_cards()`, and add `async with sem` around the existing code. – user4815162342 Mar 20 '19 at 13:37
  • 1
    For the task taking the same time you will need to debug your script to find the bottleneck, most likely network congestion, or the coroutines fighting for resources, there are no restraints on the amount of concurrent tasks, and its not parallelization either, so all that context switching is taking a toll, 35k are a lot, you may want to implement a semaphore to limit the amount of concurrent tasks – Dalvenjia Mar 20 '19 at 15:03
  • 1
    Added a semaphore, but I didn't tested it tough – Dalvenjia Mar 20 '19 at 15:11
  • Managed to do it by passing cookies around. Although i was crashing the target site's servers because of the load, semaphore helped there. Thank you! – Last_crusaider Mar 20 '19 at 15:20