Spoofing IP address when web scraping (python)

Question

I have made a web scraper in python to give me information on when free bet offers from various bookie websites have changed or new ones have been added.

However, the bookies tend to record information relating to IP traffic and MAC addresses in order to flag up matched betters.

How can I spoof my IP address when using the Request() method in the urllib.request module?

My code is below:

req = Request('https://www.888sport.com/online-sports-betting-promotions/', headers={'User-Agent': 'Mozilla/5.0'})
site = urlopen(req).read()
content = bs4.BeautifulSoup(site, 'html.parser')

this site filters by country, you need a valid proxy for do it — ZiTAL, Aug 05 '16 at 10:03
req = Request req.set_proxy(r, "HTTP"). This is throwing up an error saying set_proxy() missing 1 required positional argument: 'type'. r is a 127.x.x.x IP address (I'm using it for the purpose of testing) — Diran, Aug 05 '16 at 10:06
Literally the first link in google for "Python proxy urllib" http://stackoverflow.com/questions/3168171/how-can-i-open-a-website-with-urllib-via-proxy-in-python — Rafael Almeida, Aug 05 '16 at 11:37

score 15 · Answer 1 · edited Aug 10 '23 at 22:48

I faced the same problem a while ago. Here is my code snippet, which I am using, in order to scrape anonymously.

First install the required packages: pip3 install fake-useragent ipython

Here is the source-code:

from urllib.request import Request, urlopen
from fake_useragent import UserAgent
import random
from bs4 import BeautifulSoup
from IPython.core.display import clear_output

# Here I provide some proxies for not getting caught while scraping
ua = UserAgent() # From here we generate a random user agent
proxies = [] # Will contain proxies [ip, port]

# Main function
def main():
  # Retrieve latest proxies
  proxies_req = Request('https://www.sslproxies.org/')
  proxies_req.add_header('User-Agent', ua.random)
  proxies_doc = urlopen(proxies_req).read().decode('utf8')

  soup = BeautifulSoup(proxies_doc, 'html.parser')
  proxies_table = soup.find(id='proxylisttable')

  # Save proxies in the array
  for row in proxies_table.tbody.find_all('tr'):
    proxies.append({
      'ip':   row.find_all('td')[0].string,
      'port': row.find_all('td')[1].string
    })

  # Choose a random proxy
  proxy_index = random_proxy()
  proxy = proxies[proxy_index]

  for n in range(1, 20):
    req = Request('http://icanhazip.com')
    req.set_proxy(proxy['ip'] + ':' + proxy['port'], 'http')

    # Every 10 requests, generate a new proxy
    if n % 10 == 0:
      proxy_index = random_proxy()
      proxy = proxies[proxy_index]

    # Make the call
    try:
      my_ip = urlopen(req).read().decode('utf8')
      print('#' + str(n) + ': ' + my_ip)
      clear_output(wait = True)
    except: # If error, delete this proxy and find another one
      del proxies[proxy_index]
      print('Proxy ' + proxy['ip'] + ':' + proxy['port'] + ' deleted.')
      proxy_index = random_proxy()
      proxy = proxies[proxy_index]

# Retrieve a random index proxy (we need the index to delete it if not working)
def random_proxy():
  return random.randint(0, len(proxies) - 1)

if __name__ == '__main__':
  main()

That will create some proxies which are working. And the this part:

user_agent_list = (
   #Chrome
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (Windows NT 5.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    #Firefox
    'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 6.2; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)',
    'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)'
)

Which will create different "headers", pretending to be a browser. Last but not least just enter those into you request().

 # Make a get request
    user_agent = random.choice(user_agent_list)
    headers= {'User-Agent': user_agent, "Accept-Language": "en-US, en;q=0.5"}
    proxy = random.choice(proxies)
    response = get("your url", headers=headers, proxies=proxy)

Hope that works with you problem.

Otherwise look here: https://www.scrapehero.com/how-to-fake-and-rotate-user-agents-using-python-3/

Cheers

Hey @Yannik Suhre thanks for this. Can you clarify how user_agent_list gets generated in the first code block? I see it (user_agent_list) in the second code block (user_agent = random.choice(user_agent_list)) but don't quite get how it's created from BeautifulSoup, urllib.request, fake_useragent & all other other imports in the first code block... — JC23, May 03 '21 at 16:30
@JC23 the `user_agent_list` (which is a list although I wrongly wrote it as tuple above) is just the block copy+paste'd. So if you want to use this, you have to copy this list too. Hence this list does not depend on BS, request or any other package, bascially it is just a `list` of strings. Does this clarify it for you? Otherwise please ask again. — Yannik Suhre, May 03 '21 at 19:13
Hey @Yannik Suhre thanks for the quick reply. I guess I was just dizzy or misread your answer. — JC23, May 03 '21 at 20:01
@YannikSuhre I explained you also need to install both `fake-useragent` and `ipython` packages. — Melroy van den Berg, Aug 10 '23 at 22:48

score 5 · Answer 2 · answered Jun 19 '19 at 09:32

In order to overcome IP rate ban and hide your real IP you need to use proxies. There are a lot of different services that provide proxies. Consider using them as managing proxies by yourself is a real headache and cost would be much higher. I suggest https://botproxy.net among others. They provide rotating proxies though a single endpoint. Here is how you can make requests using this service:

#!/usr/bin/env python
import urllib.request
opener = urllib.request.build_opener(
    urllib.request.ProxyHandler(
        {'http': 'http://user-key:key-password@x.botproxy.net:8080',
         'https': 'http://user-key:key-password@x.botproxy.net:8080'}))
print(opener.open('https://httpbin.org/ip').read())

or using requests library

import requests

res = requests.get(
    'http://httpbin.org/ip',
    proxies={
        'http': 'http://user-key:key-password@x.botproxy.net:8080',
        'https': 'http://user-key:key-password@x.botproxy.net:8080'
        },
    headers={
        'X-BOTPROXY-COUNTRY': 'US'
        })
print(res.text)

They also have proxies in different countries.

score 1 · Answer 3 · answered May 08 '19 at 05:51

This might help you to browse anonymously. You can use some of free proxy sites in order to get proxies and update proxy = {}.

import requests
from bs4 import BeautifulSoup
url = ''
proxy = {"http":"http://","https":"http://"}
session = requests.session()
response = session.get(url,headers={'User-Agent': 'Mozilla/5.0'},proxies=proxy)
content = BeautifulSoup(response, 'html.parser')

Spoofing IP address when web scraping (python)

3 Answers3

Linked