Using python and selenium library to get data from op gg website

Question

I'd like to get some data from this web site - to be more precise I'd like to get information about champions' win rate over game length. HTML elements containing this information can be found inside 'tspan' elements - but strangely enough some 'tspan' elements are scraped by my code whereas others are not

the website: https://www.op.gg/champions/cassiopeia/mid/trends?region=global&tier=platinum_plus

exemplary tspan elements from this site:

<text x="60" y="31.440000000000026" class="recharts-text recharts-label" text-anchor="middle" style="fill: rgb(117, 133, 146); font-size: 11px; font-weight: bold;">
<tspan x="60" dy="0em">2nd</tspan>
</text>

<text width="60" height="131" x="39" y="42.75" stroke="none" fill="#666" font-size="11px" color="#9AA4AF" class="recharts-text recharts-cartesian-axis-tick-value" text-anchor="end">
<tspan x="39" dy="0.355em">54%</tspan>
</text>

Here's my code:

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get("https://www.op.gg/champions/cassiopeia/mid/trends?region=global&tier=platinum_plus")
results = driver.find_elements(By.TAG_NAME, "tspan");


for r in results:
    print(r.text)

OUT:

As you can see - I get the data about win ratio (eg. 56%) or the game length (eg. 25 min, 30 min etc) but I don't get anything about Cassiopeia's position in comparison to other champions (I'd like to get information that she has the first win ratio if the game lasts 25 minutes, ninth when it lasts 30 minutes and so on) - but although it should be in 'tspan' elements I don't get it. Can someone help me?

Andrej Kesely · Answer 1 · 2023-07-04T20:13:00.030

You can parse the data directly from the HTML page without selenium (is inside the <script> variable in Json form):

import json
import requests
import pandas as pd
from bs4 import BeautifulSoup


url = 'https://www.op.gg/champions/cassiopeia/mid/trends?region=global&tier=platinum_plus'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

data = soup.select_one('#__NEXT_DATA__').text
data = json.loads(data)

df_win = pd.DataFrame(data['props']['pageProps']['data']['trends']['win'])
df_ban = pd.DataFrame(data['props']['pageProps']['data']['trends']['ban'])
df_pick = pd.DataFrame(data['props']['pageProps']['data']['trends']['pick'])

print('## WIN ##')
print(df_win)
print('## BAN ##')
print(df_ban)
print('## PICK ##')
print(df_pick)

Prints:

## WIN ##
  version    rate  rank                 created_at
0   13.13  0.5257     2  2023-07-01T22:11:45+09:00
1   13.12  0.5270     1  2023-06-26T22:11:42+09:00
2   13.11  0.5146     7  2023-06-12T22:11:37+09:00
3   13.10  0.5204     5  2023-05-30T22:11:35+09:00
4   13.09  0.5218     3  2023-05-16T22:11:39+09:00
5   13.08  0.5193     5  2023-05-01T22:11:38+09:00
6   13.07  0.5224     4  2023-04-17T22:11:37+09:00
7   13.06  0.5234     7  2023-04-03T22:11:35+09:00
8   13.05  0.5233     6  2023-03-20T22:11:34+09:00
9   13.04  0.5304     2  2023-03-09T22:11:23+09:00
## BAN ##
  version    rate  rank                 created_at
0   13.13  0.0217    23  2023-07-01T22:11:45+09:00
1   13.12  0.0170    25  2023-06-26T22:11:42+09:00
2   13.11  0.0213    24  2023-06-12T22:11:37+09:00
3   13.10  0.0268    22  2023-05-30T22:11:35+09:00
4   13.09  0.0381    17  2023-05-16T22:11:39+09:00
5   13.08  0.0367    19  2023-05-01T22:11:38+09:00
6   13.07  0.0324    20  2023-04-17T22:11:37+09:00
7   13.06  0.0306    19  2023-04-03T22:11:35+09:00
8   13.05  0.0271    18  2023-03-20T22:11:34+09:00
9   13.04  0.0311    18  2023-03-09T22:11:23+09:00
## PICK ##
  version    rate  rank                 created_at
0   13.13  0.0286    27  2023-07-01T22:11:45+09:00
1   13.12  0.0240    32  2023-06-26T22:11:42+09:00
2   13.11  0.0256    31  2023-06-12T22:11:37+09:00
3   13.10  0.0280    25  2023-05-30T22:11:35+09:00
4   13.09  0.0346    20  2023-05-16T22:11:39+09:00
5   13.08  0.0358    21  2023-05-01T22:11:38+09:00
6   13.07  0.0329    23  2023-04-17T22:11:37+09:00
7   13.06  0.0325    23  2023-04-03T22:11:35+09:00
8   13.05  0.0312    23  2023-03-20T22:11:34+09:00
9   13.04  0.0377    17  2023-03-09T22:11:23+09:00

EDIT: Added game lenghts stats:

df_game_lengths = pd.DataFrame(data['props']['pageProps']['data']['game_lengths'])
print(df_game_lengths)

Prints:

   game_length      rate  average  rank
0            0  0.513963      0.5    19
1           25  0.549140      0.5     1
2           30  0.520647      0.5     9
3           35  0.507682      0.5    22
4           40  0.508130      0.5    25

Andrej, OP needs win rate data by game length whereas this is win rate by patch version — Zero, Jul 04 '23 at 20:04
How did you find out that the data is inside script? When I wanted to view what the page looks like i right clicked and then 'inspect', then I looked for a particular win rate eg. '2nd' and I only got this text inside 'tspan' elements. — Artur Szafraniak, Jul 09 '23 at 21:11
@ArturSzafraniak Click `Ctrl + U` to view page source and you will see that the page the server sends it's different that the browser shows (browser is executing Javascript which modifies the page). — Andrej Kesely, Jul 09 '23 at 21:14
Ok, I click ctrl+u and I see that the page is different than that showed by the browser but is there any easy way to figure out that win rates were exactly in this script and that I needed this line of code: 'data['props']['pageProps']['data']['trends']['win']' to get it? I, for example, get this html element with this code: next_data = soup.select_one('#__NEXT_DATA__') and then I use html viewer to see how this looks like: https://codebeautify.org/htmlviewer But it gives me around thirteen thousand lines of code. — Artur Szafraniak, Jul 11 '23 at 09:47

undetected Selenium · Answer 2 · 2023-07-04T20:15:24.707

To print the texts from Win Rate you can use the following locator strategies:

Code block:

driver.get("https://www.op.gg/champions/cassiopeia/mid/trends?region=global&tier=platinum_plus")
print(driver.find_element(By.XPATH, "//h3[./strong[text()='Win Rate']]//following::div[1]//div[@class='rank']").text)
print(driver.find_element(By.XPATH, "//h3[./strong[text()='Win Rate']]//following::div[1]//div[@class='rate']").text)

Note : You have to add the following imports :

from selenium.webdriver.common.by import By

Console Output:
```
2 nd / 58
52.56%
```

Ideally to extract the text from Win Rate you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:

Code block:

driver.get("https://www.op.gg/champions/cassiopeia/mid/trends?region=global&tier=platinum_plus")
print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//h3[./strong[text()='Win Rate']]//following::div[1]//div[@class='rank']"))).text)
print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//h3[./strong[text()='Win Rate']]//following::div[1]//div[@class='rate']"))).text)

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python

References

Link to useful documentation:

get_attribute() method Gets the given attribute or property of the element.
text attribute returns The text of the element.
Difference between text and innerHTML using Selenium

Using python and selenium library to get data from op gg website

2 Answers2

References