I'm trying to scrape data https://www.bps.go.id/indicator/3/1/25/inflasi-umum-.html from 1995-to this day,is there any way i can do it? i'm stuck because each year have spesific table and different html. Thank you in advance
Asked
Active
Viewed 224 times
1

Sajan
- 1,247
- 1
- 5
- 13

adinda aulia
- 183
- 3
- 12
-
The ```lxml``` and ```beautifulsoup``` libraries might be useful. – Sajan Oct 01 '20 at 09:40
-
can i do it one time? or i must scrape every html bcs each date have different html – adinda aulia Oct 01 '20 at 13:47
-
I think once for each page should be sufficient. – Sajan Oct 02 '20 at 09:18
1 Answers
2
First, extract the options value from the select
tag to get the url for each year :
import requests
from bs4 import BeautifulSoup
import pandas as pd
baseUrl = "https://www.bps.go.id"
dateFrom = 1995
dateTo = 2019
#get the options
r = requests.get(f"{baseUrl}/indicator/3/1/25/inflasi-umum-.html")
soup = BeautifulSoup(r.text, "html.parser")
years = dict([
(t.text, t["value"])
for t in soup.find("select").findAll("option")
if t.get("value")
])
And then iterate through your range for each year, and use pandas to extract the table so you have a dictionnary with key as year and Dataframe as value :
#iterate through years
data = {}
ranges = range(dateFrom, dateTo + 1)
for n in ranges:
print(f"get data for year {n}")
r = requests.get(f"{baseUrl}{years[str(n)]}")
table = pd.read_html(r.text)
data[str(n)] = table[2]
print(data)

Bertrand Martel
- 42,756
- 16
- 135
- 159
-
thank you so much! can i ask,why year have to be dict?? can i change it to something else? – adinda aulia Oct 02 '20 at 01:43
-
In the code bave years is like `{ "2019": "/......." }` so that you can use years["2019"] in the loop to get the path value directly but you can also make an array if you want – Bertrand Martel Oct 02 '20 at 01:47
-
okay thank! one more question,do you have an idea to delete the very first output, its because i want to convert it to csv .sorry for bother you =>[ 0 1 0 DATA SENSUS NaN, 0 1 2 3 0 NaN Facebook NaN Instagram 1 NaN Twitter NaN Youtube – adinda aulia Oct 02 '20 at 02:03
-
1I've updated the code above with table[2] to get only the table without the header – Bertrand Martel Oct 02 '20 at 02:09
-
umm..sorry can i ask one more question?,i've been trying to convert dict to dataframe,but it always error same goes with saving to csv,can you help me once again – adinda aulia Oct 02 '20 at 04:26
-
1