1

As it says in the title, I am trying to access a url through several different proxies sequentially (using for loop). Right now this is my code:

import requests
import json
with open('proxies.txt') as proxies:
    for line in proxies:
        proxy=json.loads(line)
        with open('urls.txt') as urls:
        for line in urls:
            url=line.rstrip()
            data=requests.get(url, proxies={'http':line})
            data1=data.text
            print data1

and my urls.txt file:

http://api.exip.org/?call=ip

and my proxies.txt file:

{"https": "84.22.41.1:3128"}
{"http":"194.126.181.47:81"}
{"http":"218.108.170.170:82"}

that I got at [www.hidemyass.com][1]

for some reason, the output is

68.6.34.253
68.6.34.253
68.6.34.253

as if it is accessing that website through my own router ip address. In other words, it is not trying to access through the proxies I give it, it is just looping through and using my own over and over again. What am I doing wrong?

Ben Sidhom
  • 1,548
  • 16
  • 25
BigBoy1337
  • 4,735
  • 16
  • 70
  • 138
  • As I suggested on one of your previous questions, you would find it a lot easier to understand what's happening if you print out some of the intermediate values you're passing around, or run in a debugger or interactive visualizer or some other way of seeing them. If you printed out each `{'http': line}`, it would be pretty obvious what was going wrong. – abarnert Aug 22 '13 at 00:47
  • Why would I print out each {'http':line}? Wouldn't that just print the url a bunch of times? Shouldn't I be printing out the html on the webpage so that I can verify that it is the proxy server ip address? – BigBoy1337 Aug 22 '13 at 02:38
  • 1
    If you don't know what it would print out, you will learn what's happening. If you think you know what it would print out, you will learn whether you're right. This is the most basic debugging there is. Clearly something in your script is not doing what you expected. The first step is to figure out at which point things are going wrong, and the only way to do that is to look at the values and see whether they're wrong. – abarnert Aug 22 '13 at 02:40

3 Answers3

3

According to this thread, you need to specify the proxies dictionary as {"protocol" : "ip:port"}, so your proxies file should look like

{"https": "84.22.41.1.3128"}
{"http": "194.126.181.47:81"}
{"http": "218.108.170.170:82"}

EDIT: You're reusing line for both URLs and proxies. It's fine to reuse line in the inner loop, but you should be using proxies=proxy--you've already parsed the JSON and don't need to build another dictionary. Also, as abanert says, you should be doing a check to ensure that the protocol you're requesting matches that of the proxy. The reason the proxies are specified as a dictionary is to allow lookup for the matching protocol.

Community
  • 1
  • 1
Ben Sidhom
  • 1,548
  • 16
  • 25
2

There are two obvious problems right here:

data=requests.get(url, proxies={'http':line})

First, because you have a for line in urls: inside the for line in proxies:, line is going to be the current URL here, not the current proxy. And besides, even if you weren't reusing line, it would be the JSON string representation, not the dict you decoded from JSON.

Then, if you fix that to use proxy, instead of something like {'https': '83.22.41.1:3128'}, you're passing {'http': {'https': '83.22.41.1:3128'}}. And that obviously isn't a valid value.

To fix both of those problems, just do this:

data=requests.get(url, proxies=proxy)

Meanwhile, what happens when you have an HTTPS URL, but the current proxy is an HTTP proxy? You're not going to use the proxy. So you probably want to add something to skip over them, like this:

if urlparse.urlparse(url).scheme not in proxy:
    continue
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • I update the code to match you answer. Im not sure if I added the if statement in the right place? – BigBoy1337 Aug 22 '13 at 02:31
  • @BigBoy1337: How would I know whether you added the if statement in the right place in code that I can't even see? – abarnert Aug 22 '13 at 02:38
  • sorry I was about update the question's code but realized that the question wouldn't make sense that way. Because it is giving the right output now. I am just confused about where the if statement you gave should be inserted? – BigBoy1337 Aug 22 '13 at 02:42
  • @BigBoy1337: Right after the `url=line.rstrip()` seems like a good place. It has to be after we have the URL to check, and it has to be before the `get` that we want to skip, so there's not many options. – abarnert Aug 22 '13 at 02:45
  • @BigBoy1337: If you want to use the `urlparse` module, you need to `import urlparse`. – abarnert Aug 22 '13 at 21:26
  • An important follow-on: you should be explicitly using schemes in Requests' proxy dictionaries. `{'https': '83.22.41.1:3128'}` should become `{'https': 'https://83.22.41.1:3128'}`. We got so many bugs with the implicit schemes that in Requests 2.0 not placing a scheme will throw an exception. – Lukasa Aug 23 '13 at 07:44
  • So why it does not raising any exception or error when he obviously put wrong value as a proxy? – Milano Jul 31 '15 at 19:45
1

Directly copied from another answer of mine.

Well, actually you can, I've done this with a few lines of code and it works pretty well.

import requests


class Client:

    def __init__(self):
        self._session = requests.Session()
        self.proxies = None

    def set_proxy_pool(self, proxies, auth=None, https=True):
        """Randomly choose a proxy for every GET/POST request        
        :param proxies: list of proxies, like ["ip1:port1", "ip2:port2"]
        :param auth: if proxy needs auth
        :param https: default is True, pass False if you don't need https proxy
        """
        from random import choice

        if https:
            self.proxies = [{'http': p, 'https': p} for p in proxies]
        else:
            self.proxies = [{'http': p} for p in proxies]

        def get_with_random_proxy(url, **kwargs):
            proxy = choice(self.proxies)
            kwargs['proxies'] = proxy
            if auth:
                kwargs['auth'] = auth
            return self._session.original_get(url, **kwargs)

        def post_with_random_proxy(url, *args, **kwargs):
            proxy = choice(self.proxies)
            kwargs['proxies'] = proxy
            if auth:
                kwargs['auth'] = auth
            return self._session.original_post(url, *args, **kwargs)

        self._session.original_get = self._session.get
        self._session.get = get_with_random_proxy
        self._session.original_post = self._session.post
        self._session.post = post_with_random_proxy

    def remove_proxy_pool(self):
        self.proxies = None
        self._session.get = self._session.original_get
        self._session.post = self._session.original_post
        del self._session.original_get
        del self._session.original_post

    # You can define whatever operations using self._session

I use it like this:

client = Client()
client.set_proxy_pool(['112.25.41.136', '180.97.29.57'])

It's simple, but actually works for me.

Community
  • 1
  • 1
laike9m
  • 18,344
  • 20
  • 107
  • 140