Selenium scraping JS loaded pages

Question

I'm trying to scrape some of the loaded JS data from https://surviv.io/stats/player787, such as the number of total kills. Could someone tell me how I can scrape the js loaded data with selenium. Thanks.

EDIT: Here is some of the code

from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://surviv.io/stats/player787')
b = browser.find_element_by_tag_name('tr')

The 'tr' which contains the data that I want is not grabbed by selenium

`The 'tr' which contains the data that i want is not grabbed by selenium` - which data? There are multiple tags in HTML — Andrei Suvorkov, Dec 23 '19 at 14:18
@AaravM4 : There is lots of tr tag which table data are you after you need to mentioned as well in your post. — KunduK, Dec 23 '19 at 14:18
It is the first 'tr' that there is in the code. Here is the tr: https://i.stack.imgur.com/8rY4b.png — AaravM4, Dec 23 '19 at 14:19

chitown88 · Answer 1 · 2019-12-23T14:38:34.790

The reason it's not finding it is because the page isn't fully rendered. You can add a wait with selenium so will not move on until the specified element is rendered first.

Also, if it's in a <table> tag, let pandas do the parsing for you (it uses beautifulsoup under the hood to pull out the <table>, <th>, <tr>, and <td> tags, returns them as a list of dataframes once you get the rendered html source:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import pandas as pd

browser = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
browser.get('https://surviv.io/stats/player787')
delay = 3 # seconds
WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'player-stats-overview')))

df = pd.read_html(browser.page_source)[0]

print (df.loc[0,'Kills'])

browser.close()

Output:

18884


print (df)
   Wins  Kills  Games  K/G
0   638  18884   8896  2.1

score 2 · Answer 2 · edited Dec 23 '19 at 14:33

To get the count of kills.Induce WebDriverWait and visibility_of_all_elements_located

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver

browser = webdriver.Firefox()
browser.get('https://surviv.io/stats/player787')
allkills = WebDriverWait(browser,20).until(EC.visibility_of_all_elements_located((By.XPATH,"//div[@class='card-mode-stat-name' and text()='KILLS']/following-sibling::div[1]")))
for item in allkills:
    print(item.text)

QHarr · Answer 3 · 2019-12-24T18:25:42.080

1

You could avoid the overhead of a browser and simply mimic the POST request the page makes.

import requests

headers = {'content-type': 'application/json; charset=UTF-8'}
data = {"slug":"player787","interval":"all","mapIdFilter":"-1"}
r = requests.post('https://surviv.io/api/user_stats', headers=headers, json=data)
data = r.json()
desired_stats = ['wins', 'kills', 'games', 'kpg'] 
for stat in desired_stats:
    print(stat, ': ' , data[stat])

For OP:

View of payload in network tab visible when you click on the appropriate xhr indicated by the url in my answer (you need to scroll down to see the payload info)

edited Dec 24 '19 at 18:25

answered Dec 23 '19 at 23:08

QHarr

83,427
12
54
101

I dont clearly understand, could you elaborate a bit, @QHarr – AaravM4 Dec 24 '19 at 18:16
Using a browser is slow. If you run the browserless code above you will get stats as json. It is the same request, simplified, the page is making when it runs javascript in the browser. – QHarr Dec 24 '19 at 18:17
which items specifically do you want? – QHarr Dec 24 '19 at 18:20
How were you able to understand how to get this? – AaravM4 Dec 24 '19 at 18:20
I monitored the network traffic from the browser in the network pane of dev tools F12 and saw the request the page made. – QHarr Dec 24 '19 at 18:21
Im sorry if im a bit dumb, but how to do you monitor the network traffic? – AaravM4 Dec 24 '19 at 18:22
See https://stackoverflow.com/a/56279841/6241235 and https://stackoverflow.com/a/56924071/6241235 – QHarr Dec 24 '19 at 18:23
How were you able to understand what json data was passed though? – AaravM4 Dec 24 '19 at 18:25
I have updated answer. I could see the json listed in the network tab. – QHarr Dec 24 '19 at 18:26
Thanks a lot! I finally understand! – AaravM4 Dec 24 '19 at 18:30

score 0 · Answer 4 · answered Dec 23 '19 at 20:34

To scrape the values 652, 19152, 8926, 2.1, etc from JS loaded pages you you have to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

Using CSS_SELECTOR:

driver.get('https://surviv.io/stats/player787')
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.player-stats-overview td")))])

Using XPATH:

driver.get('https://surviv.io/stats/player787')
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@class='player-stats-overview']//td")))])

Console Output:
```
['652', '19152', '8926', '2.1']
```

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Selenium scraping JS loaded pages

4 Answers4

Linked