1

I'm trying to scrape Reddit with Nokogiri, but a single run of this keeps telling me that I'm putting in too many requests.

require 'nokogiri'
require 'open-uri'
url = "https://www.reddit.com/r/all"
redditscrape = Nokogiri::HTML(open(url))

OpenURI::HTTPError: 429 Too Many Requests

Isn't this only one request? If it's not, how do I create sleep intervals for Nokogiri?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Andrew
  • 7,201
  • 5
  • 25
  • 34
  • 1
    You're confusing Nokogiri's purpose in your code with that of OpenURI. OpenURI makes the connection then passed it to Nokogiri to read, so your question isn't about Nokogiri and shouldn't be tagged as such. It's an OpenURI question, and should be edited to reflect that. – the Tin Man Aug 16 '16 at 20:24

2 Answers2

4

Reddit has an API

You could probably query the API for the particular sub-reddit(s) you want to scrape. Attempting to scrape all of Reddit just seems like a nightmare waiting to happen considering the high volume and the nested comments.

It looks like Reddit is blocking the ability to scrape in favor of using their public API.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
sump
  • 506
  • 2
  • 15
  • Shouldn't scraping just pull all of the html data off the page in question? It's only the frontpage. Comments are just links. – Andrew Aug 16 '16 at 18:19
  • Are you trying to just get the `hot` topics? Or do you literally want every sub-reddit's name only? Without knowing what you are trying to accomplish it is difficult to give the best answer. To answer your first question, yes in theory nokogiri should return the HTML for that page. – sump Aug 16 '16 at 18:21
  • 2
    It is likely Reddit is blocking the user-agent used by open-uri. It simply return 429 as a way of saying we expect 0. – Abhishek Dujari Aug 16 '16 at 18:23
  • 2
    @Vangel I was literally about to type that haha. Looks like youtube got on that train as well not too long ago in order to funnel users to their API. Regardless though, when a site makes an API available it makes more sense from a developer standpoint to use that vs scraping. – sump Aug 16 '16 at 18:25
  • Write it as an answer for me so I can mark it as resolved! edit: sorry, I meant add the you-can't-use-nokogiri part – Andrew Aug 16 '16 at 18:26
4

The real answer is that you need to set a user-agent.

https://www.reddit.com/r/redditdev/comments/3qbll8/429_too_many_requests/

and

How to set a custom user agent in ruby

This allowed me to use open-uri and nokogiri and avoid the error.

so to summarize:

redditscrape = Nokogiri::HTML(open(url, 'User-Agent' => 'Nooby'))
Community
  • 1
  • 1
StreamedLine
  • 141
  • 5