extracting tables data from multiple pages-same website

Question

I want to import the data in the tables using python.

The website in question is : https://hydro.eaufrance.fr/sitehydro/F7000001/series

you choose the values and dates you want, you click on "ok" and it directs you to another page with the graph and multiple data table..but the url remains intact.

There are multiple tutos on how to extract html tables and scramp the web, i tried them but i believe this one requires further treatment (url unchanged, multiple pages) with selenium maybe or idk. But hey,just my thoughts and i know about python's treatment of websites.

I tried with requests, pandas and bs4, but the best i got was the source code of the page.

MY EXACT AIM IS THE FOLLOWING : creating a function, in which all you gonna give it is the code of the hydraulic station ( ull get it if u visit the website), and then it gives me the entire graph of "maximum daily flow" of the station" since its creation until the present. Other treatment will be required and put into action after.

Note that, in this case, F700 0001 is the code of the station, located in Paris..

I would like to import the datas directly to python in csv or txt format, that way i could manipulate them and extract the infos i want.

In the end, i believe this one is kinda complex, and i would be grateful is anyone proves the contrary.

Probably dynamic content loaded by Javascript. The answers [here](https://stackoverflow.com/q/8049520/5320906) may help you. — snakecharmerb, Aug 25 '23 at 16:24

score 1 · Answer 1 · answered Aug 25 '23 at 17:09

The data you see is loaded with javascript from external URL. Here is an example how you can make such request and load the data into a pandas DataFrame:

import pandas as pd
import requests

api_url = "https://hydro.eaufrance.fr/sitehydro/ajax/F7000001/series"

# change these parameters to suite your needs:
params = {
    "hydro_series[startAt]": "01/08/2023",
    "hydro_series[endAt]": "05/08/2023",
    "hydro_series[variableType]": "simple_and_interpolated_and_hourly_variable",
    "hydro_series[simpleAndInterpolatedAndHourlyVariable]": "Q",
    "hydro_series[statusData]": "most_valid",
}

df = pd.DataFrame([requests.get(api_url, params=params).json()])

df = df[["timezone", "unitH", "unitQ", "unitRR", "watershedRainsTooMuchData", "series"]]
df = pd.concat([df, df.pop("series").apply(pd.Series)], axis=1)
df = df.explode("data")
df = pd.concat([df, df.pop("data").apply(pd.Series)], axis=1)

print(df)

Prints:

  timezone unitH unitQ unitRR  watershedRainsTooMuchData      code metric timeStep    statuses  rolling unit                                                                                                                                  title baseFilename       v                     t    md   s   q   m  c
0      UTC     m     l     mm                      False  F7000001      Q     None  most_valid    False    l  Débit instantané - Données les plus valides de l'entité - F700 0001 - La Seine à Paris - du 01/08/2023 00:00 au 05/08/2023 23:59 (TU)   F7000001_Q  149000  2023-08-01T00:00:00Z  None  12  20  10  0
0      UTC     m     l     mm                      False  F7000001      Q     None  most_valid    False    l  Débit instantané - Données les plus valides de l'entité - F700 0001 - La Seine à Paris - du 01/08/2023 00:00 au 05/08/2023 23:59 (TU)   F7000001_Q  154000  2023-08-01T01:00:00Z  None  12  20  10  0
0      UTC     m     l     mm                      False  F7000001      Q     None  most_valid    False    l  Débit instantané - Données les plus valides de l'entité - F700 0001 - La Seine à Paris - du 01/08/2023 00:00 au 05/08/2023 23:59 (TU)   F7000001_Q  151000  2023-08-01T01:10:00Z  None  12  20  10  0
0      UTC     m     l     mm                      False  F7000001      Q     None  most_valid    False    l  Débit instantané - Données les plus valides de l'entité - F700 0001 - La Seine à Paris - du 01/08/2023 00:00 au 05/08/2023 23:59 (TU)   F7000001_Q  151000  2023-08-01T01:20:00Z  None  12  20  10  0

...

score 0 · Answer 2 · answered Aug 25 '23 at 17:50

Expanding on @Andrej's answer, here's how you can do solve this kind of problem on your own.

Open the page in your browser
Inspect page (right click > inspect)
Go to the Network tab
Enter your choices into form and submit
Find the new search request in the requests log

To get more details on what parameters you're allowed to send and their acceptable values, I recommend you look into the HTML or Javascript from the page. They're probably hiding around there.

Once you figure out the exact options you want, you can write a Python function that swaps out the station code and makes the request itself.

extracting tables data from multiple pages-same website

2 Answers2