5

I'm trying to scrape data from this site https://quickfs.net/company/BABA:US using the pyppeteer, without the this website will know I'm scraping.

So my first question is:

  1. Is it correct that using pyppeteer for scraping I won't be noticed (by the website) as doing scraping ?

When entering the link above on the top right there is a drop-down list with the items : Overview, Income Statement,..., Key Ratios.

I want to use pyppeteer in order to select let say, Key Ratios from the drop-down and from there the extract the data of Per-Share Items and from there the row of the Book Value.

In the last comment of a previews question I had on that website link I was told that this drop-down is "only trigger different ways to present the same data".

So my second and third questions are (maybe they are the same):

  1. Should I somehow simulate the Key Ratios being selected using the pyppeteer ?

  2. How to extract the data from the Key Ratios trigger, using pyppeteer, without the website will know that someone is scraping it?

I used those question to write a code to do so, but my code only extract data from the Overview page which is the first one.

This are the questions I based the code on

  1. How can I retrieve data from a web page with a loading screen?
  2. Scraping content using pyppeteer in association with asyncio

I also tried to understand from this article: Web Scraping with a Headless Browser: A Puppeteer Tutorial how to use bottoms but it's not using pyppeteer for Python but the Puppeteer

And this is the code I used:

import pyppeteer
import asyncio

async def main():
# launches a chromium browser, can use chrome instead of chromium as well.
browser = await pyppeteer.launch(headless=False)
# creates a blank page
page = await browser.newPage()
# follows to the requested page and runs the dynamic code on the site.
await page.goto("https://api.quickfs.net/stocks/BABA:US/ovr/Annual/")
# provides the html content of the page
cont = await page.content()
return cont

# prints the html code
print(asyncio.get_event_loop().run_until_complete(main()))
ovr=(asyncio.get_event_loop().run_until_complete(main()))

Thanks in advance

TaL
  • 173
  • 2
  • 15

1 Answers1

3

Question 1: Is it correct that using pyppeteer for scraping I won't be noticed (by the website) as doing scraping?

Simple Answer: Yes. This website is using javascript so you will need a something like pyppeteer to render the webpage. Also using pyppeteer will emulate as if you are a regular user. So less chance of getting detected.

Technical Answer: This requires more web scraping experience but if you look at the requests that are being called. The website uses an API to render the data. So it would be more efficient to just make the request to the API with the appropriate methods and headers to avoid being detected.

GET https://api.quickfs.net/stocks/BABA:US/ovr/Annual/

{"datasets":{"metadata":{"_id":{},"qfs_symbol":"NYSE:BABA","currency":"USD","fsCat":"normal","name":"Alibaba Group Holding Limited","gs3_version_at_metadata_update":20191106,"exchange":"NYSE","industry":"Retailing","symbol":"BABA","country":"US","price":215.7,"p_pretax_inc":"24.9","ps":"8.1","ev_ebit":"42.5","ev_fcf":"21.7","ev_s":"7.7","ev_ebitda":"37.2","pb":"4.7","mkt_cap":588375,"pe":"27.7","ev_pretax_inc":"23.6","ev":558430,"qfs_symbol_v2":"BABA:US","description":"","avg_vol_50d":19498671,"beta":1.8212,"betaLastUpdated":20200419,"share_turnover":"180","sector":"Consumer Discretionary","template_version":4,"gics":"25502020","template_type":"normal"},"ks":"\n\t\t          <div class=\"ksTblBg\">\n\t\t            <table class=\"ksTbl\">\n\t\t              <thead>\n\t\t                <tr>\n\t\t                  <th colspan=\"6\" style=\"text-align:center\">Key Statistics<\/th>\n\t\t                <\/tr>\n\t\t              <\/thead>\n\t\t              <tbody>\n\t\t                \n\t\t                        <tr>\n\t\t                          <td class=\"ksSectHead\" colspan=\"2\">Valuation Ratios<\/td>\n\t\t                          <td class=\"ksSectHead\" colspan=\"2\">10-Yr Median Returns<\/td>\n\t\t                          <td class=\"ksSectHead\" colspan=\"2\">10-Yr Median Margins<\/td>\n\t\t                        <\/tr>\n\t\t                        <tr>\n\t\t                            <td class='lt'>P\/E<\/td><td class='rt' id='ks-pe'><\/td>\n\t\t                            <td class='lt'>ROA<\/td><td class='rt'>13.0%<\/td>\n\t\t                            <td class='lt'>Gross Profit<\/td><td class='rt'>66.7%<\/td>\n\t\t                        <\/tr>\n\t\t                        <tr>\n\t\t                            <td class='lt'>P\/B<\/td><td class='rt' id='ks-pb'><\/td>\n\t\t                            <td class='lt'>ROE<\/td><td class='rt'>22.3%<\/td>\n\t\t                            <td class='lt'>EBIT<\/td><td class='rt'>28.9%<\/td>\n\t\t                        <\/tr>\n\t\t                        <tr>\n\t\t                            <td class='lt'>P\/S<\/td><td class='rt' id='ks-ps'><\/td>\n\t\t                            <td class='lt'>ROIC<\/td><td class='rt'>30.4%<\/td>\n\t\t                            <td class='lt'>Pre-Tax Income<\/td><td class='rt'>35.6%<\/td>\n\t\t                        <\/tr>\n\t\t                        <tr>\n\t\t                            <td class='lt'>EV\/S<\/td><td class='rt' id='ks-ev_s'><\/td>\n\t\t                            <td class='ksSectHead' colspan='2'>10-Year CAGR<\/td>\n\t\t                            <td class='lt'>FCF<\/td><td class='rt'>40.8%<\/td>\n\t\t                        <\/tr>\n\t\t                        <tr>\n\t\t                            <td class='lt'>EV\/EBITDA<\/td><td class='rt' id='ks-ev_ebitda'><\/td>\n\t\t                            <td class='lt'>Revenue<\/td><td class='rt'>56.3%<\/td>\n\t\t                            <td class='ksSectHead' colspan='2'>Capital Structure<\/td>\n\t\t                        <\/tr>\n\t\t                        <tr>\n\t\t                            <td class='lt'>EV\/EBIT<\/td><td class='rt' id='ks-ev_ebit'><\/td>\n\t\t                            <td class='lt'>Assets<\/td><td class='rt'>58.2%<\/td>\n\t\t                            <td class='lt'>Assets \/ Equity<\/td><td class='rt'>1.6<\/td>\n\t\t                        <\/tr>\n\t\t                        <tr>\n\t\t                            <td class='lt'>EV\/Pretax<\/td><td class='rt' id='ks-ev_pretax_income'><\/td>\n\t\t                            <td class='lt'>FCF<\/td><td class='rt'>51.1%<\/td>\n\t\t                            <td class='lt'>Debt \/ Equity<\/td><td class='rt'>0.3<\/td>\n\t\t                        <\/tr>\n\t\t                        <tr>\n\t\t                            <td class='lt'>EV\/FCF<\/td><td class='rt' id='ks-ev_fcf'><\/td>\n\t\t                            <td class='lt'>EPS<\/td><td class='rt'>68.6%<\/td>\n\t\t                            <td class='lt'>Debt \/ Assets<\/td><td class='rt'>0.2<\/td>\n\t\t                        <\/tr>\n\t\t                    \n\t\t              <\/tbody>\n\t\t            <\/table>\n\t\t          <\/div>","ovr":"<table class='fs-table' id='ovr-table'>\n                    <tbody>\n                        <tr class='thead'><td><\/td><td>2011<\/td><td>2012<\/td><td>2013<\/td><td>2014<\/td><td>2015<\/td><td>2016<\/td><td>2017<\/td><td>2018<\/td><td>2019<\/td><td>2020<\/td><\/tr><tr class=' '><td class='labelCell'>Revenue<\/td><td class='dataCell' data-type='normal' data-value='1010821000'>1,011<\/td><td class='dataCell' data-type='normal' data-value='3172277000'>3,172<\/td><td class='dataCell' data-type='normal' data-value='5553464000'>5,553<\/td><td class='dataCell' data-type='normal' data-value='8505565000'>8,506<\/td><td class='dataCell' data-type='normal' data-value='12214920000'>12,215<\/td><td class='dataCell' data-type='normal' data-value='15554001000'>15,554<\/td><td class='dataCell' data-type='normal' data-value='22958079000'>22,958<\/td><td class='dataCell' data-type='normal' data-value='39615348000'>39,615<\/td><td class='dataCell' data-type='normal' data-value='56145652000'>56,146<\/td><td class='dataCell' data-type='normal' data-value='72603233000'>72,603<\/td><\/tr><tr class=' '><td class='labelCell italic indent'>Revenue Growth<\/td><td class='dataCell italic' data-type='percentage' data-value='0.20945600737049'>20.9%<\/td><td class='dataCell italic' data-type='percentage' data-value='2.1383172688339'>213.8%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.75062392092494'>75.1%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.53157830860162'>53.2%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.43610918263513'>43.6%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.27336085705023'>27.3%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.47602401465706'>47.6%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.72555151500263'>72.6%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.41727019537983'>41.7%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.29312298305842'>29.3%<\/td><\/tr><tr class=' '><td class='labelCell'>Gross Profit<\/td><td class='dataCell' data-type='normal' data-value='812343000'>812<\/td><td class='dataCell' data-type='normal' data-value='2134020000'>2,134<\/td><td class='dataCell' data-type='normal' data-value='3989767000'>3,990<\/td><td class='dataCell' data-type='normal' data-value='6339808000'>6,340<\/td><td class='dataCell' data-type='normal' data-value='8394512000'>8,395<\/td><td class='dataCell' data-type='normal' data-value='10270811000'>10,271<\/td><td class='dataCell' data-type='normal' data-value='14329852000'>14,330<\/td><td class='dataCell' data-type='normal' data-value='22671036000'>22,671<\/td><td class='dataCell' data-type='normal' data-value='25315484000'>25,315<\/td><td class='dataCell' data-type='normal' data-value='32382879000'>32,383<\/td><\/tr><tr class=' '><td class='labelCell italic indent'>Gross Margin %<\/td><td class='dataCell italic' data-type='percentage' data-value='0.80364673864116'>80.4%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.67270922432057'>67.3%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.71842853397447'>71.8%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.74537176542652'>74.5%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.68723430034744'>68.7%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.66033241221985'>66.0%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.62417469684637'>62.4%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.57227910758224'>57.2%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.45088948294696'>45.1%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.44602530303299'>44.6%<\/td><\/tr><tr class=' '><td class='labelCell'>Operating Profit<\/td><td class='dataCell' data-type='normal' data-value='266271000'>266<\/td><td class='dataCell' data-type='normal' data-value='847525000'>848<\/td><td class='dataCell' data-type='normal' data-value='1820317000'>1,820<\/td><td class='dataCell' data-type='normal' data-value='4084952000'>4,085<\/td><td class='dataCell' data-type='normal' data-value='3736415000'>3,736<\/td><td class='dataCell' data-type='normal' data-value='4607009000'>4,607<\/td><td class='dataCell' data-type='normal' data-value='7035973000'>7,036<\/td><td class='dataCell' data-type='normal' data-value='11137968000'>11,138<\/td><td class='dataCell' data-type='normal' data-value='8604121000'>8,604<\/td><td class='dataCell' data-type='normal' data-value='13105334000'>13,105<\/td><\/tr><tr class=' '><td class='labelCell italic indent'>Operating Margin %<\/td><td class='dataCell italic' data-type='percentage' data-value='0.26342052648293'>26.3%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.267166139653'>26.7%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.32778046278863'>32.8%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.48026815384986'>48.0%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.30588943685264'>30.6%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.29619446469111'>29.6%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.30647045861285'>30.6%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.28115285015293'>28.1%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.15324643482633'>15.3%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.18050620417964'>18.1%<\/td><\/tr><tr class=' '><td class='labelCell'>Earnings Per Share<\/td><td class='dataCell' data-type='eps' data-value='0.053'>$0.05<\/td><td class='dataCell' data-type='eps' data-value='0.287'>$0.29<\/td><td class='dataCell' data-type='eps' data-value='0.574'>$0.57<\/td><td class='dataCell' data-type='eps' data-value='1.62'>$1.62<\/td><td class='dataCell' data-type='eps' data-value='1.555'>$1.56<\/td><td class='dataCell' data-type='eps' data-value='4.289'>$4.29<\/td><td class='dataCell' data-type='eps' data-value='2.462'>$2.46<\/td><td class='dataCell' data-type='eps' data-value='3.88'>$3.88<\/td><td class='dataCell' data-type='eps' data-value='4.973'>$4.97<\/td><td class='dataCell' data-type='eps' data-value='7.965'>$7.97<\/td><\/tr><tr class=' '><td class='labelCell italic indent'>EPS Growth<\/td><td class='dataCell italic' data-type='percentage' data-value='0.23255813953488'>23.3%<\/td><td class='dataCell italic' data-type='percentage' data-value='4.4150943396226'>441.5%<\/td><td class='dataCell italic' data-type='percentage' data-value='1'>100.0%<\/td><td class='dataCell italic' data-type='percentage' data-value='1.8222996515679'>182.2%<\/td><td class='dataCell italic' data-type='percentage' data-value='-0.040123456790124'>-4.0%<\/td><td class='dataCell italic' data-type='percentage' data-value='1.7581993569132'>175.8%<\/td><td class='dataCell italic' data-type='percentage' data-value='-0.42597342037771'>-42.6%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.57595450852965'>57.6%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.28170103092784'>28.2%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.60164890408204'>60.2%<\/td><\/tr><tr class=' '><td class='labelCell'>Return on Assets<\/td><td class='dataCell' data-type='percentage' data-value='0.12490081137912'>12.5%<\/td><td class='dataCell' data-type='percentage' data-value='0.13547055438638'>13.5%<\/td><td class='dataCell' data-type='percentage' data-value='0.15474766176667'>15.5%<\/td><td class='dataCell' data-type='percentage' data-value='0.26661125490522'>26.7%<\/td><td class='dataCell' data-type='percentage' data-value='0.13179228026259'>13.2%<\/td><td class='dataCell' data-type='percentage' data-value='0.22667998521059'>22.7%<\/td><td class='dataCell' data-type='percentage' data-value='0.097819038194011'>9.8%<\/td><td class='dataCell' data-type='percentage' data-value='0.10848994208585'>10.8%<\/td><td class='dataCell' data-type='percentage' data-value='0.10177987302833'>10.2%<\/td><td class='dataCell' data-type='percentage' data-value='0.12868657986734'>12.9%<\/td><\/tr><tr class=' '><td class='labelCell'>Return on Equity<\/td><td class='dataCell' data-type='percentage' data-value='0.26227533616942'>26.2%<\/td><td class='dataCell' data-type='percentage' data-value='0.20200237445123'>20.2%<\/td><td class='dataCell' data-type='percentage' data-value='0.38077278024081'>38.1%<\/td><td class='dataCell' data-type='percentage' data-value='0.90392646328004'>90.4%<\/td><td class='dataCell' data-type='percentage' data-value='0.24438521190488'>24.4%<\/td><td class='dataCell' data-type='percentage' data-value='0.34553804941983'>34.6%<\/td><td class='dataCell' data-type='percentage' data-value='0.14914185483796'>14.9%<\/td><td class='dataCell' data-type='percentage' data-value='0.17542701201745'>17.5%<\/td><td class='dataCell' data-type='percentage' data-value='0.16392435911507'>16.4%<\/td><td class='dataCell' data-type='percentage' data-value='0.19830371476362'>19.8%<\/td><\/tr><tr class=' '><td class='labelCell'>Return on Invested Capital<\/td><td class='dataCell' data-type='percentage' data-value='0.41743100812616'>41.7%<\/td><td class='dataCell' data-type='percentage' data-value='0.31146385668929'>31.1%<\/td><td class='dataCell' data-type='percentage' data-value='0.56166392937543'>56.2%<\/td><td class='dataCell' data-type='percentage' data-value='0.79357545168436'>79.4%<\/td><td class='dataCell' data-type='percentage' data-value='0.29563665163366'>29.6%<\/td><td class='dataCell' data-type='percentage' data-value='0.40666624726852'>40.7%<\/td><td class='dataCell' data-type='percentage' data-value='0.15645567128128'>15.6%<\/td><td class='dataCell' data-type='percentage' data-value='0.17835726067885'>17.8%<\/td><td class='dataCell' data-type='percentage' data-value='0.15560704472355'>15.6%<\/td><td class='dataCell' data-type='percentage' data-value='0.20185124127701'>20.2%<\/td><\/tr><\/tbody><\/table>","chart":[["2006-12",0],["2007-12",-2.7333985391131],["2008-12",1.4806594382205],["2009-12",0.44823138109063],["2010-12",0.57515717254689],["2011-12",0.41743100812616],["2012-03",0.31146385668929],["2013-03",0.56166392937543],["2014-03",0.79357545168436],["2015-03",0.29563665163366],["2016-03",0.40666624726852],["2017-03",0.15645567128128],["2018-03",0.17835726067885],["2019-03",0.15560704472355],["2020-03",0.20185124127701]]},"errors":[],"code":0,"qfs_symbol_v2":"BABA:US","statementPeriod":"Annual"}

Question 2: Should I somehow simulate the Key Ratios being selected using the pyppeteer?

Simple Answer: pyppeteer uses css selector to select elements on the page. To select that dropdown menu you need find the selector path that will get you that element. You can use something like Chrome DevTools (F12) to right click the element and copy the css selector. Then to call the dropdown menu with pypeteer:

# select the button for Key Ratios
await page.select('body > app-root > app-company > div > div > div.pageHead > div > div:nth-child(3) > div.col-xs-offset-3.col-xs-2 > select-fs-dropdown > div > button > div')

You should be able to read the documentation for pyppeteer to get a better idea of how to actually do this.

Question 3: How to extract the data from the Key Ratios trigger, using pyppeteer, without the website will know that someone is scraping it?

Short Answer: You can grab the table from using selectors similar to the answer to question 2. Then parsing out the table.

Technical Answer: With a better understanding of how the website works. You can reverse engineer the website to get an idea of how it works. Using something like Chrome DevTools you can see that it calls out to an API. The API returns all the data you need in a JSON format which is easy to parse. Using the API is straight forward. Just change the stock ticker.

# get data for Alibaba
https://api.quickfs.net/stocks/BABA:US/ovr/Annual/

# get data for Tesla
https://api.quickfs.net/stocks/TSLA:US/ovr/Annual/

# get data for Apple
https://api.quickfs.net/stocks/AAPL:US/ovr/Annual/

Then you can simply call the API in Python with requests:

import requests
resp = requests.get("https://api.quickfs.net/stocks/AAPL:US/ovr/Annual/")
data = resp.json
TuanGeek
  • 751
  • 5
  • 7
  • Thank you for the detailed answers! If OK with you I would like to ask about your answer to question 2. I tried to run the code you wrote but got error -"pyppeteer.errors.ElementHandleError: Error: failed to find element matching selector : (here is what in the brackets in the answer code) " how did you know what should written in the selector? (I read the documentation but can't figure out what should be in it). my second question is about the code to question 3 you wrote. it seems it is not using the pyppeteer but the requests package - is there a way to do the same with the pyppeteer? – TaL Jul 06 '20 at 12:01