Actually, this page turned out to be a pretty cool challenge!
Breakdown:
- The link to the report sits in the source HTML, but the table is rendered dynamically by JavaScript but you can easily scoop it out
- The
safeargs_data
value
5f5f7265706f72743d504c5f5553455f524242265f5f63616c6c547970653d7026646174613d323032332d30382d3130265f737667737570706f72743d74727565267265736f7572636549443d72656e646572696e6755524c265f5f706167654e756d6265723d31265f5f626174636849443d31383964663635613634642d31
is just a silly way of obfuscating in hex this value
__report=PL_USE_RBB&__callType=p&data=2023-08-10&_svgsupport=true&resourceID=renderingURL&__pageNumber=1&__batchID=189df65a64d-1
- I've decoded it for ease of readability and editing e.g. the
data
key
- Finally, I use the
table_link
, payload
data, and updated headers
to make a POST
request.
- Then, it's easy to get the table out of the
JSON
and parse it with pandas
By the way, if you convert the hex value from the URL and add the safeargs_data
to it, you'll still get your report.
Here's a full, decoded URL:
https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb?p_auth=2XVP5Wtz&p_p_id=VisioPortlet_WAR_visioneoportlet_INSTANCE_xOsekso49yXt&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&p_p_col_id=column-2&p_p_col_pos=1&p_p_col_count=2&_VisioPortlet_WAR_visioneoportlet_INSTANCE_xOsekso49yXt___action=processEdit&__action=processEdit__report=PL_USE_RBB&__callType=p&data=2023-08-10&_svgsupport=true&resourceID=renderingURL&__pageNumber=1&__batchID=189df65a64d-1
Here's my take on it:
import binascii
from urllib.parse import urlencode
import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
url = (
"https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/"
"raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb"
)
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.200",
}
with requests.Session() as session:
table_link = (
BeautifulSoup(session.get(url, headers=headers).content, "lxml")
.select_one("a[class='vui-generic-url']")
.get("href")
)
headers.update({"X-Requested-With": "XMLHttpRequest"})
payload_data = {
"__report": "PL_USE_RBB",
"__callType": "p",
"data": "2023-08-10",
"_svgsupport": "true",
"resourceID": "renderingURL",
"__pageNumber": "1",
"__batchID": "189df65a64d-1",
}
hex_it = binascii.hexlify(urlencode(payload_data).encode()).decode()
table_data = session.post(
table_link,
data={"safeargs_data": hex_it},
headers=headers,
)
df = pd.read_html(
# .replace() is used to get rid of NBSPs
table_data.json()["reportContent"].replace("\xa0", ""),
flavor="lxml",
skiprows=[0],
)[1]
df.dropna(how="all", inplace=True)
df.to_csv("PL_USE_RBB.csv", index=False)
print(tabulate(df, headers="keys", tablefmt="psql", showindex=False))
This should save a .csv
file PL_USE_RBB.csv
and then print this:
+----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------+
| ('Numer iteracji', '24') | ('Początek', '2023-08-10 15:15:34') | ('Koniec', '2023-08-10 15:16:06') | ('Początek', '17') | ('Koniec', '24') | ('[MWh]', '47561,000') |
|----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------|
| 23 | 2023-08-10 14:15:42 | 2023-08-10 14:16:16 | 16 | 24 | 4.88788e+07 |
| 22 | 2023-08-10 13:15:39 | 2023-08-10 13:16:24 | 15 | 24 | 4.50884e+07 |
| 21 | 2023-08-10 12:15:36 | 2023-08-10 12:16:10 | 14 | 24 | 4.09294e+07 |
| 20 | 2023-08-10 11:15:33 | 2023-08-10 11:16:15 | 13 | 24 | 3.12136e+07 |
| 19 | 2023-08-10 10:15:41 | 2023-08-10 10:16:07 | 12 | 24 | 2.55946e+07 |
| 18 | 2023-08-10 09:15:40 | 2023-08-10 09:16:05 | 11 | 24 | 2.26086e+07 |
| 17 | 2023-08-10 08:15:40 | 2023-08-10 08:16:00 | 10 | 24 | 1.58324e+07 |
| 16 | 2023-08-10 07:15:35 | 2023-08-10 07:15:56 | 9 | 24 | 1.11414e+07 |
| 15 | 2023-08-10 06:15:33 | 2023-08-10 06:15:52 | 8 | 24 | 1.11796e+07 |
| 14 | 2023-08-10 05:15:32 | 2023-08-10 05:15:52 | 7 | 24 | 9.639e+06 |
| 13 | 2023-08-10 04:15:41 | 2023-08-10 04:16:11 | 6 | 24 | 9.0502e+06 |
| 12 | 2023-08-10 03:15:36 | 2023-08-10 03:15:55 | 5 | 24 | 7.871e+06 |
| 11 | 2023-08-10 02:15:35 | 2023-08-10 02:16:03 | 4 | 24 | 8.395e+06 |
| 10 | 2023-08-10 01:15:41 | 2023-08-10 01:16:04 | 3 | 24 | 7.8954e+06 |
| 9 | 2023-08-10 00:15:37 | 2023-08-10 00:15:55 | 2 | 24 | 8.2582e+06 |
| 8 | 2023-08-09 23:15:03 | 2023-08-09 23:15:24 | 1 | 24 | 6.6784e+06 |
| 7 | 2023-08-09 22:15:08 | 2023-08-09 22:15:16 | 1 | 24 | 603200 |
| 6 | 2023-08-09 21:15:12 | 2023-08-09 21:15:22 | 1 | 24 | 0 |
| 5 | 2023-08-09 20:15:06 | 2023-08-09 20:15:12 | 1 | 24 | 0 |
| 4 | 2023-08-09 19:15:04 | 2023-08-09 19:15:14 | 1 | 24 | 0 |
| 3 | 2023-08-09 18:15:11 | 2023-08-09 18:15:32 | 1 | 24 | 0 |
| 2 | 2023-08-09 17:15:11 | 2023-08-09 17:15:22 | 1 | 24 | 0 |
| 1 | 2023-08-09 16:15:09 | 2023-08-09 16:15:31 | 1 | 24 | 0 |
+----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------+