The webapi "vip.stock.finance.sina.com.cn/q/go.php/vFinanceAnalyze/kind/profit/index.phtml"
,provide a query called by get
method,and paging with 40 lines in every page.
I write a function to call the webapi and print all rows in the webpage:
def get_rows(page):
import urllib.request
import lxml.html
url = "http://vip.stock.finance.sina.com.cn/q/go.php/vFinanceAnalyze/kind/profit/"\
"index.phtml?s_i=&s_a=&s_c=&reportdate=2021&quarter=4&p={}".format(page)
table_xpath = '//*[@id="dataTable"]'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url=url, headers=headers)
data_string=urllib.request.urlopen(req).read()
root=lxml.html.fromstring(data_string)
dtable = root.xpath(table_xpath)[0]
rows = dtable.xpath('.//tr')
print(len(rows))
Now call it :
get_rows(page=1)
41
get_rows(page=2)
41
get_rows(page=3)
26
get_rows(page=4)
41
Why my function can only get part of lines (26) for page 3 when the webpage contains 40 lines(41=1 header + 40 lines data) surely? I find many pages which run into same issue ,the wbeoage contains 40 lines data ,get_rows() print a number less than 40.Please try with my function:
[get_rows(page) for page in [3,38,73,81,118,123]]