scraping from each table based on dates

Question

I'm trying to scrape data https://www.bps.go.id/indicator/3/1/25/inflasi-umum-.html from 1995-to this day,is there any way i can do it? i'm stuck because each year have spesific table and different html. Thank you in advance

The ```lxml``` and ```beautifulsoup``` libraries might be useful. — Sajan, Oct 01 '20 at 09:40
can i do it one time? or i must scrape every html bcs each date have different html — adinda aulia, Oct 01 '20 at 13:47

Bertrand Martel · Accepted Answer · 2020-10-02T03:18:58.587

2

First, extract the options value from the select tag to get the url for each year :

import requests
from bs4 import BeautifulSoup
import pandas as pd

baseUrl = "https://www.bps.go.id"
dateFrom = 1995
dateTo = 2019

#get the options 
r = requests.get(f"{baseUrl}/indicator/3/1/25/inflasi-umum-.html")
soup = BeautifulSoup(r.text, "html.parser")
years = dict([
    (t.text, t["value"]) 
    for t in soup.find("select").findAll("option") 
    if t.get("value")
])

And then iterate through your range for each year, and use pandas to extract the table so you have a dictionnary with key as year and Dataframe as value :

#iterate through years
data = {}
ranges = range(dateFrom, dateTo + 1)
for n in ranges:
  print(f"get data for year {n}")
  r = requests.get(f"{baseUrl}{years[str(n)]}")
  table = pd.read_html(r.text)
  data[str(n)] = table[2]

print(data)

Try this on repl.it

edited Oct 02 '20 at 03:18

answered Oct 01 '20 at 14:38

Bertrand Martel

42,756
16
135
159

thank you so much! can i ask,why year have to be dict?? can i change it to something else? – adinda aulia Oct 02 '20 at 01:43
In the code bave years is like `{ "2019": "/......." }` so that you can use years["2019"] in the loop to get the path value directly but you can also make an array if you want – Bertrand Martel Oct 02 '20 at 01:47
okay thank! one more question,do you have an idea to delete the very first output, its because i want to convert it to csv .sorry for bother you =>[ 0 1 0 DATA SENSUS NaN, 0 1 2 3 0 NaN Facebook NaN Instagram 1 NaN Twitter NaN Youtube – adinda aulia Oct 02 '20 at 02:03
1

I've updated the code above with table[2] to get only the table without the header – Bertrand Martel Oct 02 '20 at 02:09
umm..sorry can i ask one more question?,i've been trying to convert dict to dataframe,but it always error same goes with saving to csv,can you help me once again – adinda aulia Oct 02 '20 at 04:26
1

See [this](https://stackoverflow.com/q/18837262/2614364) – Bertrand Martel Oct 02 '20 at 10:39

scraping from each table based on dates

1 Answers1