How to scrape website tables with varying values depending on the selections

Question

I am trying to scrape:

https://id.investing.com/commodities/gold-historical-data

table from 2010-2020, but the problem is the link between the default date and the date that I chose is still the same. So how can I tell python to scrape data from 2010-2020? please help me I'm using python 3.

This is my code:

import requests, bs4

url = 'https://id.investing.com/commodities/gold-historical-data'
headers = {"User-Agent":"Mozilla/5.0"}
response = requests.get(url, headers=headers)
soup = bs4.BeautifulSoup(response.text, 'lxml')
tables = soup.find_all('table')

print(soup)

with open('emasfile.csv','w') as csv:
    for row in tables[1].find_all('tr'):
        line = ""
        for td in row.find_all(['td', 'th']):
            line += '"' + td.text + '",'
        csv.write(line + '\n')

first check how web brower does it - use `DevTools` (tab `Network`) in Chrome/Firefox to see all requests from browser to server when you change date. Maybe it uses extra data in URL. OR it uses POST request with extra data. OR it uses JavaScript with AJAX to send request which send extra data — furas, Sep 11 '20 at 03:57
Most likely you need to use post.request instead of get. These links should be helpful: https://stackoverflow.com/questions/16390257/scraping-ajax-pages-using-python#16395938 and https://stackoverflow.com/questions/53890493/scraping-data-from-investing-com-for-btc-eth-using-beautifulsoup — footfalcon, Sep 11 '20 at 04:02

furas · Accepted Answer · 2020-09-11T04:49:05.227

1

This page uses JavaScript with AJAX to get data from

https://id.investing.com/instruments/HistoricalDataAjax

It sends POST requests with extra data - start date and end date ("st_date", "end_date")

You can try to use 01/01/2010, 12/31/2020 but I used for-loop to get every year separatelly.

I get all information from DevTool (tab 'Network') in Chrome/Firefox.

import requests
from bs4 import BeautifulSoup
import csv

url = 'https://id.investing.com/instruments/HistoricalDataAjax'

payload = {
    "curr_id": "8830",
    "smlID": "300004",
    "header": "Data+Historis+Emas+Berjangka",
    "st_date": "01/30/2020",
    "end_date": "12/31/2020",
    "interval_sec": "Daily",
    "sort_col": "date",
    "sort_ord": "DESC",
    "action":"historical_data"
}

headers = {
    #"Referer": "https://id.investing.com/commodities/gold-historical-data",
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0",
    "X-Requested-With": "XMLHttpRequest"
}

fh = open('output.csv', 'w')
csv_writer = csv.writer(fh)

for year in range(2010, 2021):
    print('year:', year)
    
    payload["st_date"] = f"01/01/{year}"
    payload["end_date"] = f"12/31/{year}"
    
    r = requests.post(url, data=payload, headers=headers)
    #print(r.text)
    
    soup = BeautifulSoup(r.text, 'lxml')
    table = soup.find('table')
    for row in table.find_all('tr')[1:]: # [1:] to skip header
        row_data = [item.text for item in row.find_all('td')]
        print(row_data)
        csv_writer.writerow(row_data)
        
fh.close()

edited Sep 11 '20 at 04:49

answered Sep 11 '20 at 04:08

furas

134,197
12
106
148

thank you so much! can i ask where to find "smlID": "300004",? i dont find it in dev tools tab network. – adinda aulia Sep 17 '20 at 04:42
in Firefox in tab `network` when you click on link https://id.investing.com/instruments/HistoricalDataAjax then it shows details on right side - headers, cookies, request data, response data, etc. And in request data I can see all values which I use in `payload`. In Chrome it should also display details when you click link in DevTools. – furas Sep 17 '20 at 07:39
if you check HTML then you can find also `` and it can be more useful because page may use different `smlId` for different users. – furas Sep 17 '20 at 07:44
ok i got it ! thank youu,one more question,can i get rid blank space between row bcs when i import it to mysql it makes the row become null. im sorry if i got too many question... – adinda aulia Sep 17 '20 at 08:44
I don't get any problem on Linux but on Windows you may need `open(..., newline='')`. See [CSV in Python adding an extra carriage return, on Windows](https://stackoverflow.com/questions/3191528/csv-in-python-adding-an-extra-carriage-return-on-windows) – furas Sep 17 '20 at 08:51

How to scrape website tables with varying values depending on the selections

1 Answers1