Python webscraping Javascript with Await

Question

I have a problem concerning webscraping with Python. I'm trying to get the data from the first table from https://www.nyse.com/ipo-center/filings by using from requests_html import AsyncHTMLSession.

My code is here:

from bs4 import BeautifulSoup
from requests_html import AsyncHTMLSession

#first define the URL and start the session
url = 'http://www.nyse.com/ipo-center/filings'
session = AsyncHTMLSession()

#then get the URL content, and load the html content after parsing through the javascript
r = await session.get(url)
await r.html.arender()

#then we create a beautifulsoup object based on the rendered html
soup = BeautifulSoup(r.html.html, "lxml")

#then we find the first datatable, which is the one that contains upcoming IPO data
table1 = soup.find('table', class_='table table-data table-condensed spacer-lg')

Now I have 2 problems with that:

Oftentimes the website doesn't return any valid information from the table1, so I don't get the underlying information that's inside the table. So far I'm circumventing that by simply waiting a couple of seconds, and then run the loop again, until the dataframe is loaded. Probably not the best option though.
The code does work within Jupyter Notebook, but once I upload it in .py format on my Server, I get the error message that SyntaxError: 'await' outside async function.

Does anybody have a solution to the 2 problems mentioned above?

https://stackoverflow.com/questions/59130200/selenium-wait-until-element-is-present-visible-and-interactable — Ευάγγελος Γρηγορόπουλος, Feb 02 '22 at 22:14
Not sure about #1, but for #2 you can probably fix this by putting your logic in an `async` function. — Henry Woody, Feb 02 '22 at 22:16

score 0 · Accepted Answer · answered Feb 03 '22 at 04:01

Since you are using coroutines you need to wrap them inside an async function. See below example

from bs4 import BeautifulSoup
from requests_html import AsyncHTMLSession

#first define the URL and start the session
url = 'http://www.nyse.com/ipo-center/filings'
session = AsyncHTMLSession()

#then get the URL content, and load the html content after parsing through the javascript
async def get_page():
    r = await session.get(url)
    await r.html.arender(timeout=20)
    return r.text

data = session.run(get_page)

#then we create a beautifulsoup object based on the rendered html
soup = BeautifulSoup(data[0], "lxml")

#then we find the first datatable, which is the one that contains upcoming IPO data
table1 = soup.find_all('table', class_='table table-data table-condensed spacer-lg')
print(table1)

Thanks for the input! That script worked on my server where I'm running Ubuntu, and the file is with .py, but it did not work on my Jupyter Notebook on my local notebook. It uses the error "This event loop is already running". Do you have any idea why? — DominikCamargo, Feb 04 '22 at 22:21
It's quite well known but it has to do with the settings of jupter. There are a few questions on stack that explain this if you do some digging. Best to use async on an IDE besides jupyter — Stackbeans, Feb 08 '22 at 01:04

Python webscraping Javascript with Await

1 Answers1