So i've been scraping a website (www.cardsphere.com) protected pages with requests, using session, like so:
import requests
payload = {
'email': <enter-email-here>,
'password': <enter-site-password-here>
}
with requests.Session() as request:
requests.get(<site-login-page>)
request.post(<site-login-here>, data=payload)
request.get(<site-protected-page1>)
save-stuff-from-page1
request.get(<site-protected-page2>)
save-stuff-from-page2
.
.
.
request.get(<site-protected-pageN>)
save-stuff-from-pageN
the-end
Now since it's quite a bit of pages i wanted to speed it up with Aiohttp + asyncio...but i'm missing something. I've been able to more or less use it to scrape unprotected pages, like so:
import asyncio
import aiohttp
async def get_cards(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
data = await resp.text()
<do-stuff-with-data>
urls = [
'https://www.<url1>.com'
'https://www.<url2>.com'
.
.
.
'https://www.<urlN>.com'
]
loop = asyncio.get_event_loop()
loop.run_until_complete(
asyncio.gather(
*(get_cards(url) for url in urls)
)
)
That gave some results but how do i do it for pages that require login? I tried adding session.post(<login-url>,data=payload)
inside the async function but that obviously didn't work out well, it will just keep logging in. Is there a way to "set" an aiohttp ClientSession before the loop function? As i need to login first and then, on the same session, get data from a bunch of protected links with asyncio + aiohttp?
Still rather new to python, async even more so, i'm missing some key concept here. If anybody would point me in the right direction i'll greatly appreciate it.