Why does BeautifulSoup fail to extract data from websites to csv?

Question

User Chrisvdberge helped me creating the following code :

import pandas as pd
import requests
from bs4 import BeautifulSoup

url_DAX = 'https://www.eurexchange.com/exchange-en/market-data/statistics/market-statistics-online/100!onlineStats?viewType=4&productGroupId=13394&productId=34642&cp=&month=&year=&busDate=20191114'
req = requests.get(url_DAX, verify=False)
html = req.text
soup = BeautifulSoup(html, 'lxml')
df = pd.read_html(str(html))[0]
df.to_csv('results_DAX.csv')
print(df)

url_DOW = 'https://www.cmegroup.com/trading/equity-index/us-index/e-mini-dow_quotes_settlements_futures.html'
req = requests.get(url_DOW, verify=False)
html = req.text
soup = BeautifulSoup(html, 'lxml')
df = pd.read_html(str(html))[0]
df.to_csv('results_DOW.csv')
print(df)

url_NASDAQ = 'https://www.cmegroup.com/trading/equity-index/us-index/e-mini-nasdaq-100_quotes_settlements_futures.html'
req = requests.get(url_NASDAQ, verify=False)
html = req.text
soup = BeautifulSoup(html, 'lxml')
df = pd.read_html(str(html))[0]
df.to_csv('results_NASDAQ.csv')
print(df)

url_CAC = 'https://live.euronext.com/fr/product/index-futures/FCE-DPAR/settlement-prices'
req = requests.get(url_CAC, verify=False)
html = req.text
soup = BeautifulSoup(html, 'lxml')
df = pd.read_html(str(html))[0]
df.to_csv('results_CAC.csv')
print(df)

I have the following result :

3 .csv files are created : results_DAX.csv (here, everything is ok, I have the values I want.) ; results_DOW.csv and results_NASDAQ.csv (here, the problem is that the .csv files don't have the wanted values.. I don't understand why ?)
As you can see in the code, 4 files should be created and not only 3.

So my questions are :

How to get 4 csv files ?
How to get values in the results_DOW.csv and in the results_NASDAQ.csv files ? (and maybe also in the results_CAC.csv file)

Thank you for your answers ! :)

`https://live.euronext.com/fr/product/index-futures/FCE-DPAR/settlement-prices`, there are no `` tags in the html to parse. The other 2, looks like they are dynamic sites, so you'd need to use Selenium, or start a session. Also, no need to use `requests`, `beautifulsoup` and `read_html`. pandas' `read_html` uses beautifulsoup under the hood — chitown88, Nov 16 '19 at 16:06
it also looks like there is an API/XHR that you can get this data from as well: `https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/318/FUT?tradeDate=11/15/2019&strategy=DEFAULT&pageSize=500&_=1573920407835` for example — chitown88, Nov 16 '19 at 16:08
Where did you find the link ?! :o I've spent hours searching for that .. I've tried to use this link instead of the other one, but no .csv file is being created.. I guess it's because there is no table in this link. Ok for the dynamic sites.. so that's the reason why it doesn't work. Could you help me with the use of Selenium to solve my problem ? :) — manny-, Nov 16 '19 at 16:15
you can find by looking unter the Network -> XHR tab in Dev Tools (Ctrl-alt-I). You might need to reload the page, but then you can see the requests made to get the different data — chitown88, Nov 16 '19 at 16:19
Wow I've learnt something with XHR tab ! Thanks. Do you think except the tradeDate part, other part of the link will change everyday ? — manny-, Nov 16 '19 at 16:26
the root_url won't change. But ya, you may need to change the parameters in the `payload` to fit what you want. So I'm guessing yes, the `'tradeDate'` will probably be different to get the current data for the date — chitown88, Nov 16 '19 at 16:37
i've added "import time" + "date = time.strftime("%m/%d/%Y")" to have the format needed. Then I wrote in the payload "'tradeDate': {date}," but it doesn't work. I guess it's not the right way to insert variable in the payload ? Any idea ? :) — manny-, Nov 16 '19 at 17:05
I added str(time.strftime('%Y%m%d')) in the url, seemed to work. But I want to put some conditions now : if today's date is from monday to friday then .. (if datetime.datetime.now().isoweekday() in range (1, 7):) but i have some issues now.. I'll create another question ! You've already answered the main question here :) — manny-, Nov 17 '19 at 13:10
https://stackoverflow.com/questions/58901554/syntax-to-use-to-make-minus-x-days-with-python-3-8 — manny-, Nov 17 '19 at 14:36

chitown88 · Accepted Answer · 2019-11-18T23:04:02.437

1

Try this to get those other sites. The last site is a little trickier, so you'd need to try out Selenium:

import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import date, timedelta

url_DAX = 'https://www.eurexchange.com/exchange-en/market-data/statistics/market-statistics-online/100!onlineStats?viewType=4&productGroupId=13394&productId=34642&cp=&month=&year=&busDate=20191114'
df = pd.read_html(url_DAX)[0]
df.to_csv('results_DAX.csv')
print(df)



dt = date.today() - timedelta(days=2)
dateParam =  dt.strftime('%m/%d/%Y')


url_DOW = 'https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/318/FUT'
payload = {
'tradeDate': dateParam,
'strategy': 'DEFAULT',
'pageSize': '500',
'_': '1573920502874'}
response = requests.get(url_DOW, params=payload).json()
df = pd.DataFrame(response['settlements'])
df.to_csv('results_DOW.csv')
print(df)


url_NASDAQ = 'https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/146/FUT'
payload = {
'tradeDate': dateParam,
'strategy': 'DEFAULT',
'pageSize': '500',
'_': '1573920650587'}
response = requests.get(url_NASDAQ, params=payload).json()
df = pd.DataFrame(response['settlements'])
df.to_csv('results_NASDAQ.csv')
print(df)

edited Nov 18 '19 at 23:04

answered Nov 16 '19 at 16:18

chitown88

27,527
4
30
59

Working perfectly ! Any idea what the "'_': '1573920650587';" part refers to ? – manny- Nov 16 '19 at 16:37
not quite sure to be honest. It's also possible it's not needed. Try removing that, and see what happens... – chitown88 Nov 16 '19 at 16:38
I've refreshed the page, and in the XHR tab, I have another number than you.. so maybe it's just a random number hmm – manny- Nov 16 '19 at 16:40
ya just might be something generated. like I said, might work woithout it anyway – chitown88 Nov 16 '19 at 16:40
Super ! Thanks again – manny- Nov 19 '19 at 18:21

Why does BeautifulSoup fail to extract data from websites to csv?

1 Answers1