-1

I have a script, and when running I get an error message:

urllib2.HTTPError: HTTP Error 400: Bad Request

Can you help me?

from lxml import html
import urllib2
import urllib

ip_list = []
port_list = []
protocol_list = []
array = [20, 40]
ck = True
i = 0
while i < len(array) :
    h = urllib2.urlopen('http://proxylist.me/proxys/index/'+ str(array[i]))
    HTML_CODE = h.read()
    tree = html.fromstring(HTML_CODE)
    for block in tree.xpath('//tbody/tr'):
        ip, port, _, protocol, _, _, _, _, _ = [
            x.strip()
            for x in block.xpath('.//text()')
                if x.strip() not in ""
            ]
        ip_l = "{}".format(ip)
        port_l = "{}".format(port)
        protocol_l = "{}".format(protocol)
        if ip_l != {}:
            ck = True
            ip_list.append(ip_l)
            port_list.append(port_l)
            protocol_list.append(protocol_l)
            i = i+1
        else:
            ck = False
    print ip_list

I am getting this error:

Traceback (most recent call last):
  File "C:/Users/PC0308-PC/Desktop/get_data_html.py", line 11, in <module>
    h = urllib2.urlopen('http://proxylist.me/proxys/index/'+str(i))
  File "C:\Python27\lib\urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python27\lib\urllib2.py", line 437, in open
    response = meth(req, response)
  File "C:\Python27\lib\urllib2.py", line 550, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python27\lib\urllib2.py", line 475, in error
    return self._call_chain(*args)
  File "C:\Python27\lib\urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "C:\Python27\lib\urllib2.py", line 558, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 400: Bad Request
Mr.Junsu
  • 97
  • 1
  • 12
  • What is the actual endpoint you are trying to hit? `i` is a list and you're casting it to a `str`, then appending that to the URL. I don't think that's what you want to be doing. – dursk Nov 18 '15 at 02:57
  • @dursk,Can you help me edit the script so that it can taken data of the first two pages are not? – Mr.Junsu Nov 18 '15 at 03:32
  • Please read [How do I ask a good question?](http://stackoverflow.com/help/how-to-ask) before attempting to ask more questions. –  May 05 '17 at 04:25
  • [What does your step debugger tell you?](http://stackoverflow.com/questions/25385173/what-is-a-debugger-and-how-can-it-help-me-diagnose-problems) –  May 05 '17 at 04:25

1 Answers1

0
array = [0, 20, 40]
ck = True
for item in array:
    h = urllib2.urlopen('http://proxylist.me/proxys/index/%s'%(item))
    HTML_CODE = h.read()
    tree = html.fromstring(HTML_CODE)
    for block in tree.xpath('//tbody/tr'):
        ip, port, _, protocol, _, _, _, _, _ = [
            x.strip()
            for x in block.xpath('.//text()')
                if x.strip() not in ""
            ]
        ip_l = "{}".format(ip)
        port_l = "{}".format(port)
        protocol_l = "{}".format(protocol)
        if ip_l != {}:
            ck = True
            ip_list.append(ip_l)
            port_list.append(port_l)
            protocol_list.append(protocol_l)
        else:
            ck = False
    print ip_list

It works on my Windows machine, parses the first 3 pages from http://proxylist.me/proxys/index

BTW, your code worked well from the beginning, but it only parsed the first page.

minhhn2910
  • 488
  • 4
  • 18