8

When I make requests using Beautiful Soup, I get blocked as a "bot".

import requests
from bs4 import BeautifulSoup

reddit1Link = requests.get("https://www.reddit.com/r/tensorflow/comments/650p49/question_im_a_techy_35_year_old_and_i_think_ai_is/")
reddit1Content =BeautifulSoup(reddit1Link.content,"lxml")
print(reddit1Content)

Then I get messages from Reddit saying that they suspected me as a bot.

What are possible solutions through Beautiful Soup? (I have tried Scrapy to use its Crawlera, but due to my lack of python knowledge, I cannot use it.) I don't mind if it is a paid service, as long as it is "intuitive" enough for a beginner to use.

CottonCandy
  • 444
  • 2
  • 5
  • 15

2 Answers2

24

There can be various reasons for being blocked as a bot.

As you are using the requests library "as is", the most probable reason for the block is a missing User Agent header.

A first line of defense against bots and scraping is to check the User Agent header for being from one of the major browsers and block all non-browser user agents.

Short version: try this:

import requests
from bs4 import BeautifulSoup

headers = requests.utils.default_headers()
headers.update({
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})

reddit1Link = requests.get("https://www.reddit.com/r/tensorflow/comments/650p49/question_im_a_techy_35_year_old_and_i_think_ai_is/", headers=headers)
reddit1Content =BeautifulSoup(reddit1Link.content,"lxml")
print(reddit1Content)

Detailled explanation: Sending "User-agent" using Requests library in Python

Community
  • 1
  • 1
Done Data Solutions
  • 2,156
  • 19
  • 32
4

I used to use Mechanize for stuff like this, it has been a couple of years, but it should still work.

Try something like this:

from mechanize import Browser
from bs4 import BeautifulSoup

b = Browser()
b.set_handle_robots(False)
b.addheaders = [('Referer', 'https://www.reddit.com'), ('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

b.open('https://www.reddit.com/r/tensorflow/comments/650p49/question_im_a_techy_35_year_old_and_i_think_ai_is/')
soup = BeautifulSoup(b.response().read(), "html.parser")

EDIT:

I just realized that, sadly, mechanize is only availble for python 2.5-2.7, there are however, other options available. See Installing mechanize for python 3.4

Community
  • 1
  • 1
Raudbjorn
  • 450
  • 2
  • 8
  • 1
    Thanks Raudbjorn! This doesn't work for python 3, does it? – CottonCandy Apr 16 '17 at 18:36
  • No, sorry, see my edited answer. But other libraries do allow you to add headers and turn off robot handling; just remember that it can be a good idea to throttle a little bit, or you might "get caught", use time.sleep(1) after each request to avoid being identified as a bot. – Raudbjorn Apr 16 '17 at 18:44
  • It already support python3 since version 0.4.0, so I deleted the reference from the answer. – szedjani Aug 17 '19 at 14:03