0

im having this issue - i need to scrape a dynamic table's data from this webpage - https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb

This webpage uses ajax to generate the table that I want to fetch. I have inspected the element and it seems to be straightforward, I have the request url with param, I try to send a request, get response code 200 and the response is empty.

I must be doing something wrong, but im not sure how to fetch this data in python even though it seems kind of straightforward, could anyone help me out?

I want to get the same table as the one that is displayed on the website.

baduker
  • 19,152
  • 9
  • 33
  • 56
  • 1
    Show us your coding efforts. What have you tried until now? – Barry the Platipus Aug 10 '23 at 11:10
  • Well, thanks for comments, but what code shall I pass, there are 3 lines of code only: url = "" # here goes url from console > network > XHR > header section, which is to long to pass to the comment response = requests.post(url) print(response.content) Moreover, there is similar thread here - https://stackoverflow.com/questions/16390257/scraping-ajax-pages-using-python so Im not sure what is wrong with my topic / what shall I clarify. I also tried with scrapy but the response was also empty. – Nice Guy Eddie Aug 10 '23 at 12:22

1 Answers1

3

Actually, this page turned out to be a pretty cool challenge!

Breakdown:

  • The link to the report sits in the source HTML, but the table is rendered dynamically by JavaScript but you can easily scoop it out
  • The safeargs_data value
5f5f7265706f72743d504c5f5553455f524242265f5f63616c6c547970653d7026646174613d323032332d30382d3130265f737667737570706f72743d74727565267265736f7572636549443d72656e646572696e6755524c265f5f706167654e756d6265723d31265f5f626174636849443d31383964663635613634642d31

is just a silly way of obfuscating in hex this value

__report=PL_USE_RBB&__callType=p&data=2023-08-10&_svgsupport=true&resourceID=renderingURL&__pageNumber=1&__batchID=189df65a64d-1
  • I've decoded it for ease of readability and editing e.g. the data key
  • Finally, I use the table_link, payload data, and updated headers to make a POST request.
  • Then, it's easy to get the table out of the JSON and parse it with pandas

By the way, if you convert the hex value from the URL and add the safeargs_data to it, you'll still get your report.

Here's a full, decoded URL:

https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb?p_auth=2XVP5Wtz&p_p_id=VisioPortlet_WAR_visioneoportlet_INSTANCE_xOsekso49yXt&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&p_p_col_id=column-2&p_p_col_pos=1&p_p_col_count=2&_VisioPortlet_WAR_visioneoportlet_INSTANCE_xOsekso49yXt___action=processEdit&__action=processEdit__report=PL_USE_RBB&__callType=p&data=2023-08-10&_svgsupport=true&resourceID=renderingURL&__pageNumber=1&__batchID=189df65a64d-1

Here's my take on it:

import binascii
from urllib.parse import urlencode

import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

url = (
    "https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/"
    "raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb"
)

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.200",
}

with requests.Session() as session:
    table_link = (
        BeautifulSoup(session.get(url, headers=headers).content, "lxml")
        .select_one("a[class='vui-generic-url']")
        .get("href")
    )

    headers.update({"X-Requested-With": "XMLHttpRequest"})

    payload_data = {
        "__report": "PL_USE_RBB",
        "__callType": "p",
        "data": "2023-08-10",
        "_svgsupport": "true",
        "resourceID": "renderingURL",
        "__pageNumber": "1",
        "__batchID": "189df65a64d-1",
    }

    hex_it = binascii.hexlify(urlencode(payload_data).encode()).decode()

    table_data = session.post(
        table_link,
        data={"safeargs_data": hex_it},
        headers=headers,
    )

    df = pd.read_html(
        # .replace() is used to get rid of NBSPs
        table_data.json()["reportContent"].replace("\xa0", ""),
        flavor="lxml",
        skiprows=[0],
    )[1]
    df.dropna(how="all", inplace=True)
    df.to_csv("PL_USE_RBB.csv", index=False)

    print(tabulate(df, headers="keys", tablefmt="psql", showindex=False))

This should save a .csv file PL_USE_RBB.csv and then print this:

+----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------+
|   ('Numer iteracji', '24') | ('Początek', '2023-08-10 15:15:34')   | ('Koniec', '2023-08-10 15:16:06')   |   ('Początek', '17') |   ('Koniec', '24') |   ('[MWh]', '47561,000') |
|----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------|
|                         23 | 2023-08-10 14:15:42                   | 2023-08-10 14:16:16                 |                   16 |                 24 |              4.88788e+07 |
|                         22 | 2023-08-10 13:15:39                   | 2023-08-10 13:16:24                 |                   15 |                 24 |              4.50884e+07 |
|                         21 | 2023-08-10 12:15:36                   | 2023-08-10 12:16:10                 |                   14 |                 24 |              4.09294e+07 |
|                         20 | 2023-08-10 11:15:33                   | 2023-08-10 11:16:15                 |                   13 |                 24 |              3.12136e+07 |
|                         19 | 2023-08-10 10:15:41                   | 2023-08-10 10:16:07                 |                   12 |                 24 |              2.55946e+07 |
|                         18 | 2023-08-10 09:15:40                   | 2023-08-10 09:16:05                 |                   11 |                 24 |              2.26086e+07 |
|                         17 | 2023-08-10 08:15:40                   | 2023-08-10 08:16:00                 |                   10 |                 24 |              1.58324e+07 |
|                         16 | 2023-08-10 07:15:35                   | 2023-08-10 07:15:56                 |                    9 |                 24 |              1.11414e+07 |
|                         15 | 2023-08-10 06:15:33                   | 2023-08-10 06:15:52                 |                    8 |                 24 |              1.11796e+07 |
|                         14 | 2023-08-10 05:15:32                   | 2023-08-10 05:15:52                 |                    7 |                 24 |              9.639e+06   |
|                         13 | 2023-08-10 04:15:41                   | 2023-08-10 04:16:11                 |                    6 |                 24 |              9.0502e+06  |
|                         12 | 2023-08-10 03:15:36                   | 2023-08-10 03:15:55                 |                    5 |                 24 |              7.871e+06   |
|                         11 | 2023-08-10 02:15:35                   | 2023-08-10 02:16:03                 |                    4 |                 24 |              8.395e+06   |
|                         10 | 2023-08-10 01:15:41                   | 2023-08-10 01:16:04                 |                    3 |                 24 |              7.8954e+06  |
|                          9 | 2023-08-10 00:15:37                   | 2023-08-10 00:15:55                 |                    2 |                 24 |              8.2582e+06  |
|                          8 | 2023-08-09 23:15:03                   | 2023-08-09 23:15:24                 |                    1 |                 24 |              6.6784e+06  |
|                          7 | 2023-08-09 22:15:08                   | 2023-08-09 22:15:16                 |                    1 |                 24 |         603200           |
|                          6 | 2023-08-09 21:15:12                   | 2023-08-09 21:15:22                 |                    1 |                 24 |              0           |
|                          5 | 2023-08-09 20:15:06                   | 2023-08-09 20:15:12                 |                    1 |                 24 |              0           |
|                          4 | 2023-08-09 19:15:04                   | 2023-08-09 19:15:14                 |                    1 |                 24 |              0           |
|                          3 | 2023-08-09 18:15:11                   | 2023-08-09 18:15:32                 |                    1 |                 24 |              0           |
|                          2 | 2023-08-09 17:15:11                   | 2023-08-09 17:15:22                 |                    1 |                 24 |              0           |
|                          1 | 2023-08-09 16:15:09                   | 2023-08-09 16:15:31                 |                    1 |                 24 |              0           |
+----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------+
baduker
  • 19,152
  • 9
  • 33
  • 56
  • Sorry for very junior question, but my struggle was what to put in the requests.get as payload and headers as in the headers tab there were plenty of attributes. How did You figure out that You need to pass only "User-Agent"? Instead of that link that Yoou use, I was using the one with safeargs from payload tab - that seems to be mistake as well. Thanks thoigh, I need to figure out what is happening in the code though :) – Nice Guy Eddie Aug 10 '23 at 12:43