0

I have a list of user ids and I'm interested in crawling their reputation.

I wrote a script using beautifulsoup that crawls users reputation. But the problem is, I get Too many requests error when my script has run for less than a minute. After that, I am unable to open the Stack Overflow manually on browser too.

My question is, how do I crawl the reputation without getting too many request error?

My code is given below:

for id in df['target']:
    url='https://stackoverflow.com/users/'+str(id)
    print(url)
    response=get(url)
    html_soup=BeautifulSoup(response.text, 'html.parser') 
    site_title = html_soup.find("title").contents[0]
    if "Page Not Found - Stack Overflow" in site_title:
        reputation="NA"
    else:    
        reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")
        print(reputation)
Pang
  • 9,564
  • 146
  • 81
  • 122
nzy
  • 854
  • 2
  • 15
  • 28

2 Answers2

1

I suggest using the Python time module and throwing a time.sleep(5) in your for loop. The error is coming from you making too many requests in too short a time period. You may have to play around with the actual sleep time to get it right, though.

ahota
  • 439
  • 5
  • 16
0

You can check if response.status_code == 429 and see if there is a value in the response telling you how long to wait for, then wait for the number of seconds you've been asked to.

I duplicated the issue here. I couldn't find any information on how long to wait in the content or the headers.

I suggest putting in some throttles and adjust until you're happy with the results.

See https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site for an example for getting user reputations from the Stack Exchange Data Explorer.

Example follows.

#!/usr/bin/env python

import time
import requests
from bs4 import BeautifulSoup

df={}
df['target']=[ ... ] # see https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site

throttle = 2
whoa = 450

with open('results.txt','w') as file_handler:
    file_handler.write('url\treputation\n')
    for id in df['target']:
        time.sleep(throttle)
        url='https://stackoverflow.com/users/'+str(id)
        print(url)
        response=requests.get(url)
        while response.status_code == 429:
            print(response.content)
            print(response.headers)
            time.sleep(whoa)
            response=requests.get(url)
        html_soup=BeautifulSoup(response.text, 'html.parser')
        site_title = html_soup.find("title").contents[0]
        if "Page Not Found - Stack Overflow" in site_title:
            reputation="NA"
        else:    
            reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")
        print('reputation: %s' % reputation)
        file_handler.write('%s\t%s\n' % (url,reputation))

Example error content.

<!DOCTYPE html>
<html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <title>Too Many Requests - Stack Exchange</title>
    <style type="text/css">
        body
        {
            color: #333;
            font-family: 'Helvetica Neue', Arial, sans-serif;
            font-size: 14px;
            background: #fff url('img/bg-noise.png') repeat left top;
            line-height: 1.4;
        }
        h1
        {
            font-size: 170%;
            line-height: 34px;
            font-weight: normal;
        }
        a { color: #366fb3; }
        a:visited { color: #12457c; }
        .wrapper {
            width:960px;
            margin: 100px auto;
            text-align:left;
        }
        .msg {
            float: left;
            width: 700px;
            padding-top: 18px;
            margin-left: 18px;
        }
    </style>
</head>
<body>
    <div class="wrapper">
        <div style="float: left;">
            <img src="https://cdn.sstatic.net/stackexchange/img/apple-touch-icon.png" alt="Stack Exchange" />
        </div>
        <div class="msg">
            <h1>Too many requests</h1>
                        <p>This IP address (nnn.nnn.nnn.nnn) has performed an unusual high number of requests and has been temporarily rate limited. If you believe this to be in error, please contact us at <a href="mailto:team@stackexchange.com?Subject=Rate%20limiting%20of%20nnn.nnn.nnn.nnn%20(Request%20ID%3A%202158483152-SYD)">team@stackexchange.com</a>.</p>
                        <p>When contacting us, please include the following information in the email:</p>
                        <p>Method: rate limit</p>
                        <p>XID: 2158483152-SYD</p>
                        <p>IP: nnn.nnn.nnn.nnn</p>
                        <p>X-Forwarded-For: nnn.nnn.nnn.nnn</p>
                        <p>User-Agent: python-requests/2.20.1</p>
                        <p>Reason: Request rate.</p>
                        <p>Time: Tue, 20 Nov 2018 21:10:55 GMT</p>
                        <p>URL: stackoverflow.com/users/nnnnnnn</p>
                        <p>Browser Location: <span id="jslocation">(not loaded)</span></p>
        </div>
    </div>
    <script>document.getElementById('jslocation').innerHTML = window.location.href;</script>
</body>
</html>

Example error headers.

{ "Content-Length": "2054", "Via": "1.1 varnish", "X-Cache": "MISS", "X-DNS-Prefetch-Control": "off", "Accept-Ranges": "bytes", "X-Timer": "S1542748255.394076,VS0,VE0", "Server": "Varnish", "Retry-After": "0", "Connection": "close", "X-Served-By": "cache-syd18924-SYD", "X-Cache-Hits": "0", "Date": "Tue, 20 Nov 2018 21:10:55 GMT", "Content-Type": "text/html" }

Keith John Hutchison
  • 4,955
  • 11
  • 46
  • 64
  • the error headers don't tell how long to wait for which is sad. Thanks for the help but a wait time for 15 minutes is a lot. I was just wondering - if it just keeps on checking the response until response is not 429 and then breaks and does the normal operation. Is wait time still necessary then? – nzy Nov 20 '18 at 21:31
  • You can adjust the throttle amounts as you wish. If the throttle is set correctly you'll never hit 'whoa'. I'm not sure how long the rate limit is. Stack overflow should send back information saying wait x number of seconds. – Keith John Hutchison Nov 20 '18 at 21:35
  • I'm doing a run with throttle = 2, whoa = 450. – Keith John Hutchison Nov 20 '18 at 21:46
  • Which processed 500 urls with no issues. – Keith John Hutchison Nov 20 '18 at 22:13
  • I will try it and let's see what happens. Thank you so much – nzy Nov 20 '18 at 22:33
  • What were your results? – Keith John Hutchison Nov 22 '18 at 06:52
  • 1
    It's still running. But I think this will get the job done. Thanks. – nzy Nov 24 '18 at 04:35
  • Around 20 million – nzy Nov 24 '18 at 19:03
  • I recommend having a look at puppeteer. https://github.com/GoogleChrome/puppeteer You could script puppeteer to wait on an operator to click 'I am not a robot'. The limit on the result set from a data.stackexchange query is currently 50000 so creating a script to query in with blocks of ids would stay within the limit. https://data.stackexchange.com/stackoverflow/query/932695 – Keith John Hutchison Nov 24 '18 at 21:16