1

`I have a code in python to read xpath from a website (https://www.op.gg/summoners/kr/Hide%20on%20bush)

import requests
import lxml.html as html
import pandas as pd

url_padre = "https://www.op.gg/summoners/br/tercermundista"

link_farm = '//div[@class="stats"]//div[@class="cs"]'

r = requests.get(url_padre) 

home=r.content.decode("utf-8") 

parser=html.fromstring(home) 
farm=parser.xpath(link_farm) 

print(farm)`

this code print "[]"

but when in the console chrome put this xpath: $x('//div[@class="stats"]//div[@class="cs"]').map(x=>x.innerText), this print the numbers i want, but my python code dont do it What is the mistake?

i want a code to solve my mistake

--------------------------edit---------------------------


Error                                     Traceback (most recent call last)
c:\Users\GCO\Desktop\Analisis de datos\borradores\fsdfs.ipynb Cell 2 in 3
      1 from playwright.sync_api import sync_playwright
----> 3 with sync_playwright() as p, p.chromium.launch() as browser:
      4     page = browser.new_page()
      5     page.goto("https://www.op.gg/summoners/kr/Hide%20on%20bush", timeout=10000)

File c:\Users\GCO\AppData\Local\Programs\Python\Python310\lib\site-packages\playwright\sync_api\_context_manager.py:47, in PlaywrightContextManager.__enter__(self)
     45             self._own_loop = True
     46         if self._loop.is_running():
---> 47             raise Error(
     48                 """It looks like you are using Playwright Sync API inside the asyncio loop.
     49 Please use the Async API instead."""
     50             )
     52         # In Python 3.7, asyncio.Process.wait() hangs because it does not use ThreadedChildWatcher
     53         # which is used in Python 3.8+. This is unix specific and also takes care about
     54         # cleaning up zombie processes. See https://bugs.python.org/issue35621
     55         if (
     56             sys.version_info[0] == 3
     57             and sys.version_info[1] == 7
     58             and sys.platform != "win32"
     59             and isinstance(asyncio.get_child_watcher(), asyncio.SafeChildWatcher)
     60         ):

Error: It looks like you are using Playwright Sync API inside the asyncio loop.
Please use the Async API instead.
  • please add contents of `print(home)` to your question. I believe you do not succeed with the request itself – pL3b Mar 09 '23 at 17:19
  • `Request blocked.` says the page. Cloudronft might be protecting itself from automation. It could be possible to workaround that but I believe it's not easy – LMC Mar 09 '23 at 17:31
  • There is a workaround, but I still cannot get elements using your xpath: https://stackoverflow.com/a/70370164/17200418 – pL3b Mar 09 '23 at 17:36
  • I think that contents that you are trying to get are dynamically loaded, so may be you need to try some more advanced parsing tool. You can try `selenium` or `playwright`. – pL3b Mar 09 '23 at 17:43
  • i can do it with this page too "https://u.gg/lol/profile/kr/hide%20on%20bush/overview", and change the code for: ('//div[@class="post-stats"]//div[@class="cs"]'), and r.status_code = 200, but the answer is the same... print me "[]" – Benjamin Correa Mar 09 '23 at 17:46
  • Actually this u.gg site is working the same way here. Have a look at my answer below. Using `playwright` is much slower but I believe it will fit better for you. Especially in case you will want to grab more data scrolling the page down. – pL3b Mar 09 '23 at 18:15
  • ok bro! i will try, thank u so much for ur answer, now im gonna try to use it for my code – Benjamin Correa Mar 09 '23 at 20:16

2 Answers2

1

As I understand you can not get dynamically generated content using requests.

Here is solution using playwright which can load whole page before parsing.

  1. Install playwright using pip install playwright
  2. Install browser and dependencies using playwright install chromium --with-deps
  3. Run following code
from playwright.sync_api import sync_playwright

with sync_playwright() as p, p.chromium.launch() as browser:
    page = browser.new_page()
    page.goto("https://www.op.gg/summoners/kr/Hide%20on%20bush", timeout=10000)
    selector = "//div[@class='stats']//div[@class='cs']/div"
    cs_stats = page.query_selector_all(selector)
    print(len(cs_stats), [cs.inner_text() for cs in cs_stats])

If you want to stick with lxml as parsing tool you can use following code:

from lxml import html
from playwright.sync_api import sync_playwright

with sync_playwright() as p, p.chromium.launch() as browser:
    page = browser.new_page()
    page.goto("https://www.op.gg/summoners/kr/Hide%20on%20bush", timeout=10000)
    selector = "//div[@class='stats']//div[@class='cs']/div"
    c = page.content()
    parser = html.fromstring(c)
    farm = parser.xpath(selector)
    print(len(farm), [cs.text for cs in farm])

P.S.

Also I have noticed that op.gg use pretty simple HTTP requests that do not need authorization. You can find desired info using this code:

import json
from urllib.request import urlopen
url = "https://op.gg/api/v1.0/internal/bypass/games/kr/summoners/4b4tvMrpRRDLvXAiQ_Vmh5yMOsD0R3GPGTUVfIanp1Httg?&limit=20"
r = urlopen(url)
games = json.load(r).get("data", [])
print(games)

games is a list of dicts that stores all info you need. CS stats are stored in list element under following keys: games[0]["myData"]["stats"]["minion_kill"]

The only difficult thing here is to find how to get summoner_id for desired user (which is 4b4tvMrpRRDLvXAiQ_Vmh5yMOsD0R3GPGTUVfIanp1Httg in your example)

pL3b
  • 1,155
  • 1
  • 3
  • 18
  • hello man, i used your code but it didnt work, i put the error in the code above pd: i installed playwright in vscode and Chromium in the cmd console because i couldn't do it in vscode – Benjamin Correa Mar 09 '23 at 20:53
  • 1
    I can see .ipynb in your path, so you are using Jupiter notebooks. Have a look at this answer to make playwright code async: https://stackoverflow.com/a/71702599/17200418 – pL3b Mar 09 '23 at 22:37
  • well i can use .py too hahaha, its works man!! thanks! i have a lot to learn, i hope to get what i want from the page modifing this code – Benjamin Correa Mar 09 '23 at 23:20
1

You can use this example how to load the data from external URL and compute the CS value:

import re
import requests


url = "https://www.op.gg/summoners/kr/Hide%20on%20bush"
api_url = "https://op.gg/api/v1.0/internal/bypass/games/kr/summoners/{summoner_id}?=&limit=20&hl=en_US&game_type=total"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/110.0"
}

html_doc = requests.get(url, headers=headers).text
summoner_id = re.search(r'"summoner_id":"(.*?)"', html_doc).group(1)

data = requests.get(api_url.format(summoner_id=summoner_id), headers=headers).json()

for d in data["data"]:
    stats = d["myData"]["stats"]
    kills = (
        stats["minion_kill"]
        + stats["neutral_minion_kill_team_jungle"]
        + stats["neutral_minion_kill_enemy_jungle"]
        + stats["neutral_minion_kill"]
    )
    cs = kills / (d['game_length_second'] / 60)
    print(f'{cs=:.1f}')

Prints:

cs=6.7
cs=8.5
cs=8.2
cs=1.4
cs=7.3
cs=8.5
cs=6.8
cs=7.7
cs=8.7
cs=8.8
cs=5.6
cs=9.9
cs=7.0
cs=9.6
cs=9.7
cs=5.0
cs=7.5
cs=9.2
cs=9.0
cs=7.9
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91