0

Check if the host allow to scrawl.

curl  http://www.etnet.com.hk/robots.txt |grep warrants
Allow: /www/tc/warrants/
Allow: /www/tc/warrants/realtime/
Allow: /www/sc/warrants/
Allow: /www/sc/warrants/realtime/
Allow: /www/eng/warrants/
Allow: /www/eng/warrants/realtime/
Allow: /mobile/tc/warrants/

Target webpage to scrawl with urllib post method.
Encounter a issue when to send post request with cookie----urllib.error.HTTPError: HTTP Error 503: Service Unavailable

send post request with cookie
I have checked request header and parameters with firefox. request header params
Now construct my post request with cookie.

import urllib.parse
import urllib.request as req
import http.cookiejar as cookie

cookie_jar = cookie.CookieJar()
opener = req.build_opener(req.HTTPCookieProcessor(cookie_jar))
req.install_opener(opener)

url = "http://www.etnet.com.hk/www/sc/warrants/search_warrant.php"
params = {
    "underasset":"HSI",
    "buttonsubmit":"搜寻",
    "formaction":"submitted"
}

headers = {
    'Accept':"text/htmlpplication/xhtml+xmlpplication/xml;q=0.mage/webp,*/*;q=0.8",
    'Accept-Encoding':"gzip, deflate",
    'Accept-Language':"en-US,en;q=0.5",
    'Connection':'keep-alive',
    'Content-Length':'500',
    'Content-Type':'application/x-www-form-urlencoded',
    "Host":"www.etnet.com.hk",
    "Origin":"http://www.etnet.com.hk",
    "Referer":"http://www.etnet.com.hk/www/sc/warrants/search_warrant.php",
    "Upgrade-Insecure-Requests":"1",
    "User-Agent":"Mozilla/5.0 (X11; Linux x86_64; rv:74.0) Gecko/20100101 Firefox/74.0"
}

query_string = urllib.parse.urlencode(params)
data = query_string.encode()
cookie_req = req.Request(url, headers=headers, data=data,method='POST')
page = req.urlopen(cookie_req).read()

I encounter a issue when to execute the above code:

urllib.error.HTTPError: HTTP Error 503: Service Unavailable

Please find out the bug in my code,and how to fix it? @NicoNing,the last issue is to count how many bytes the headers contain.

>>> s="""'Accept':'text/htmlpplication/xhtml+xmlpplication/xml;q=0.mage/webp,*/*;q=0.8',\
... 'Accept-Encoding':'gzip, deflate',\
... 'Accept-Language':'en-US,en;q=0.5',\
... 'Connection':'keep-alive',\
... 'Content-Type':'application/x-www-form-urlencoded',\
... 'Content-Length':'495',\
... 'Host':'www.etnet.com.hk',\
... 'Origin':'http://www.etnet.com.hk',\
... 'Referer':'http://www.etnet.com.hk/www/sc/warrants/search_warrant.php',\
... 'Upgrade-Insecure-Requests':'1',\
... 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64; rv:74.0) Gecko/20100101 Firefox/74.0'"""
>>> len(s)
495

It can't get proper request with the above headers,if i do write the content-length in request's headers,how to assign a value as Content-Length then?

showkey
  • 482
  • 42
  • 140
  • 295

2 Answers2

3

Just remove the header : 'Content-Length':'500'

Actually, Your request content length is not equal to 500 , but you define it at the headers , it make the server unavailable.

read doc: HTTP > HTTP headers > Content-Length

The Content-Length entity header indicates the size of the entity-body, in bytes, sent to the recipient.

In your case, if you insist on using header Content-Length, read the doc in font, get to know what it means. And then the answer is coming:

"Content-Length" : str(len(data))


import urllib.parse
import urllib.request as req
import http.cookiejar as cookie

cookie_jar = cookie.CookieJar()
opener = req.build_opener(req.HTTPCookieProcessor(cookie_jar))
req.install_opener(opener)

url = "http://www.etnet.com.hk/www/sc/warrants/search_warrant.php"
params = {
    "underasset":"HSI",
    "buttonsubmit":"搜寻",
    "formaction":"submitted"
}

query_string = urllib.parse.urlencode(params)
data = query_string.encode()

headers = {
    'Accept':"text/htmlpplication/xhtml+xmlpplication/xml;q=0.mage/webp,*/*;q=0.8",
    'Accept-Encoding':"gzip, deflate",
    'Accept-Language':"en-US,en;q=0.5",
    'Connection':'keep-alive',
    'Content-Type':'application/x-www-form-urlencoded',
    # 'Content-Length': str(len(data)),    ### optional 
    "Host":"www.etnet.com.hk",
    "Origin":"http://www.etnet.com.hk",
    "Referer":"http://www.etnet.com.hk/www/sc/warrants/search_warrant.php",
    "Upgrade-Insecure-Requests":"1",
    "User-Agent":"Mozilla/5.0 (X11; Linux x86_64; rv:74.0) Gecko/20100101 Firefox/74.0",
}


cookie_req = req.Request(url, headers=headers, data=data,method='POST')
resp = req.urlopen(cookie_req)
print(resp._method, resp.code)  # POST 200

page = resp.read()
print(page)

suggest to know more about http , and take care of all the headers you set.

NicoNing
  • 3,076
  • 12
  • 23
1

As explained in this answer , using python's requests module is more effective with http requests.

You can obtain your final output by following the below procedure.

import requests

url = "http://www.etnet.com.hk/www/sc/warrants/search_warrant.php"
params = {
    "underasset":"HSI",
    "buttonsubmit":"搜寻",
    "formaction":"submitted"
}

out=requests.post(url,data=params)

print(out.text)

I hope this is the answer you are looking for.

McLovin
  • 555
  • 8
  • 20