0

I am desperately trying to scrape this table: https://futures.huobi.com/en-us/linear_swap/info/realtime_fee/, but unfortunately by running the following code:

from bs4 import BeautifulSoup
import requests
url = "https://futures.huobi.com/en-us/linear_swap/info/realtime_fee/"
res = requests.get(url)
soup = BeautifulSoup(res.text)
soup

I don't see the table in soup. I believe it's because the data is not static, and is fetched with Javascript.

What's a general solution for scraping this kind of tables?

  • Incase if you can utilize ``pandas``, then [`pd.read_html`](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html#pandas-read-html) would be good to start with. – sushanth May 05 '21 at 15:42

1 Answers1

0

Data is dynamically pulled from another endpoint. You can simplify that request then do some manipulations to generate a similar looking output. Pandas experts will probably see obvious ways to improve this. I had to look a few things up and have included the references at the top of the code.

There are some minute differences in two of the fields between on the page and what is returned when I execute the call below.

# np.r_ https://stackoverflow.com/a/41256772 @piRSquared
# applymap https://stackoverflow.com/a/48792783 @RK1

import requests
import pandas as pd
import numpy as np

headers = {'source': 'web', 'user-agent': 'Mozilla/5.0'}
data = {"contract_code":"","sortName":"","sortDirection":""}

r = requests.post('https://futures.huobi.com/linear-swap-order/x/v1/linear_swap_all_funding_rate', headers=headers, json =data)
result = r.json()['data']

df = pd.DataFrame(result)

df.loc[:,np.r_[list(df.columns[1:3]) + list(df.columns[4:])]] = \
df.loc[:,np.r_[list(df.columns[1:3]) + list(df.columns[4:])]].apply(pd.to_numeric, errors='coerce')
df.loc[:,np.r_[list(df.columns[1:3]) + list(df.columns[4:5])]] = df.loc[:,np.r_[list(df.columns[1:3]) + list(df.columns[4:5])]].applymap("{:,.6%}".format) # 
df.iloc[:, 5:9 ] = df.iloc[:, 5:9 ].applymap("{:,.2%}".format)

df['Funding Rate Limit'] = df['min_funding_limit'] +  ' ~ ' + df['max_funding_limit']
df['Premium Deviation Limit'] = df['min_premium_limit'] +  ' ~ ' + df['max_premium_limit']

df.drop(df.columns[5:9],axis=1,inplace=True)

df.rename(columns={'instrument_id':'Contracts',
                    'premium_index':'Premium Index', 
                    'forecast_fundrate':'Funding Rate',
                    'settlement_datetime':'Current-period Funding settlement time',
                    'realtime_forecast_fundrate':'Estimated Rate'
                   }, inplace = True)
cols = ['Contracts', 'Funding Rate','Current-period Funding settlement time', 'Estimated Rate', 
        'Premium Index', 'Funding Rate Limit', 'Premium Deviation Limit']
df = df[cols]
print(df.head())
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Awesome! I kept only the first part of the solution. I didn't see those data were coming from API. Where could you find that? – Paolo Montemurro May 06 '21 at 08:59
  • In the network tab of the browser via F12 dev tools. In that tab, press F5 to refresh page and then watch the xhr web traffic. – QHarr May 06 '21 at 11:35