Why can only get part of the rows when the webpage contains 40 lines data surely?

Question

The webapi "vip.stock.finance.sina.com.cn/q/go.php/vFinanceAnalyze/kind/profit/index.phtml" ,provide a query called by get method,and paging with 40 lines in every page. I write a function to call the webapi and print all rows in the webpage:

def get_rows(page):
    import urllib.request
    import lxml.html
    url = "http://vip.stock.finance.sina.com.cn/q/go.php/vFinanceAnalyze/kind/profit/"\
          "index.phtml?s_i=&s_a=&s_c=&reportdate=2021&quarter=4&p={}".format(page)
    table_xpath = '//*[@id="dataTable"]'
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
    req = urllib.request.Request(url=url, headers=headers)
    data_string=urllib.request.urlopen(req).read()
    root=lxml.html.fromstring(data_string)
    dtable = root.xpath(table_xpath)[0]
    rows = dtable.xpath('.//tr')
    print(len(rows))

Now call it :

get_rows(page=1)
41
get_rows(page=2)
41
get_rows(page=3)
26
get_rows(page=4)
41

Why my function can only get part of lines (26) for page 3 when the webpage contains 40 lines(41=1 header + 40 lines data) surely? I find many pages which run into same issue ,the wbeoage contains 40 lines data ,get_rows() print a number less than 40.Please try with my function:

[get_rows(page) for page in [3,38,73,81,118,123]]

msbit · Answer 1 · 2023-03-05T10:16:48.717

The issue appears to be that the HTML meta tag corresponding to Content-Type identifies the character set as GB 2312, like so:

<meta http-equiv="Content-type" content="text/html; charset=GB2312" />

whereas the Content-Type header returned as part of the response identifies the character set as GBK, like so:

Content-Type: text/html; charset=gbk

As GBK is a superset of GB 2312, much of the content in the pages will be encoded identically, and so can be decoded using either character set. For the third page, however, the name of the stock corresponding to code 688279 (峰岹科技) cannot be encoded using GB 2312, and so attempting to decode it using GB 2312 will fail. The exact symptom of this failure is odd in that parsing will halt at this point (hence the short number of matched elements), but the document returned (root) can still be worked with. This is very likely the same for the other pages you've discovered with the same problem.

More concretely, in your code you are calling:

urllib.request.urlopen(req).read()

and then operating on this sequence of bytes alone. So, when passing this to lxml for parsing:

lxml.html.fromstring(data_string)

it has only the meta tag to consult for determining the encoding.

The best path forward appears to be the one outlined here, which explicitly decodes the read bytes with the character set encoding declared in the Content-Type header. So, in this particular case, this involves changing:

data_string=urllib.request.urlopen(req).read()

to something like:

resource=urllib.request.urlopen(req)
data_string=resource.read().decode(resource.headers.get_content_charset())

score 1 · Accepted Answer · answered Mar 05 '23 at 08:57

The encoding in the target webpage is gb2312,but if you use it ,an invalid encoding error occurs,tried many times ,at last to set gbk works fine!

def get_rows(page):
    import urllib.request
    import lxml.html
    url = "http://vip.stock.finance.sina.com.cn/q/go.php/vFinanceAnalyze/kind/profit/"\
          "index.phtml?s_i=&s_a=&s_c=&reportdate=2021&quarter=4&p={}".format(page)
    table_xpath = '//*[@id="dataTable"]'
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
    req = urllib.request.Request(url=url, headers=headers)
    data_string=urllib.request.urlopen(req).read().decode('gbk')
    root=lxml.html.fromstring(data_string)
    dtable = root.xpath(table_xpath)[0]
    rows = dtable.xpath('.//tr')
    print(len(rows))

Why can only get part of the rows when the webpage contains 40 lines data surely?

2 Answers2