0

I am trying to collect the retweets data from a Chinese microblog Sina Weibo, you can see the following code. However, I am suffering from the problem of IP request out of limit.

To solve this problem, I have to set time.sleep() for the code. You can see I attempted to add a line of ' time.sleep(10) # to opress the ip request limit' in the code. Thus python will sleep 10 secs after crawling a page of retweets (one page contains 200 retweets).

However, it still not sufficient to deal with the IP problem.

Thus, I am planning to more systematically make python sleep 60 secs after it has crawled every 20 pages. Your ideas will be appreciated. ids=[3388154704688495, 3388154704688494, 3388154704688492]

        addressForSavingData= "C:/Python27/weibo/Weibo_repost/repostOwsSave1.csv"    
        file = open(addressForSavingData,'wb') # save to csv file 

        for id in ids:
            if api.rate_limit_status().remaining_hits >= 205:  
                for object in api.counts(ids=id):
                    repost_count=object.__getattribute__('rt')
                    print id, repost_count
                    pages= repost_count/200 +2  # why should it be 2? cuz python starts from 0  
                    for page in range(1, pages):
                        time.sleep(10)  # to opress the ip request limit
                        for object in api.repost_timeline(id=id, count=200, page=page):  # get the repost_timeline of a weibo
                            """1.1 reposts"""
                            mid = object.__getattribute__("id")
                            text = object.__getattribute__("text").encode('gb18030')     # add encode here
                            """1.2 reposts.user"""
                            user = object.__getattribute__("user") # for object in user
                            user_id = user.id                                   
                            """2.1 retweeted_status"""
                            rts = object.__getattribute__("retweeted_status")
                            rts_mid = rts.id  # the id of weibo     
                            """2.2 retweeted_status.user"""
                            rtsuser_id = rts.user[u'id']                                                        
                            try:
                                w = csv.writer(file,delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
                                w.writerow(( mid,
                                            user_id, rts_mid,
                                            rtsuser_id, text)) # write it out   
                            except:  # Exception of UnicodeEncodeError
                                pass
            elif api.rate_limit_status().remaining_hits < 205:  
                sleep_time=api.rate_limit_status().reset_time_in_seconds # time.time()
                print sleep_time, api.rate_limit_status().reset_time
                time.sleep(sleep_time+2)
        file.close()
        pass
Frank Wang
  • 1,462
  • 3
  • 17
  • 39
  • i = 3 for page in range(1, 300): i += 1 if (i % 25 ==0): print i, "find i which could be exactly divided by 25" – Frank Wang May 20 '12 at 09:05

2 Answers2

0

Can you not just pace the script instead?

I suggest to make your script sleep in between each request instead of making a requests all at the same time. And say span over a minute.. This way you will also avoid any flooding bans and this is considered good behaviour.

Pacing your requests may also allow you to do things more quickly if the server does not time you out for sending too many requests.


If there is a limit to the IP sometimes their are no great and easy solutions. For example if you run apache http://opensource.adnovum.ch/mod_qos/ limits bandwidth and connections and specifically it limits;

  • The maximum number of concurrent requests
  • Limitation of the bandwidth such as the maximum allowed number of requests per second to an URL or the maximum/minimum of downloaded kbytes per second.
  • Limits the number of request events per second
  • Generic request line and header filter to deny unauthorized operations.
  • Request body data limitation and filtering
  • the maximum number of allowed connections from a single IP source address or dynamic keep-alive control.

You may want to start with these. You could send referrer URL's with your requests and make only single connections, not multiple connections.

You could also refer to this question

Community
  • 1
  • 1
Ross
  • 1,013
  • 14
  • 32
  • i have tried to make python sleep 10 secs and 15 secs after it makes one request (crawling one page), however, it still runs into the problem, and if i make it sleep too many secs, it will not be efficient. – Frank Wang May 14 '12 at 10:15
0

I figure out the solution:

first, give an integer, e.g 0

i = 0

second, in the for page loop, add the following code

for page in range(1, 300):
    i += 1
    if (i % 25 ==0):
        print i, "find i which could be exactly divided by 25"
Frank Wang
  • 1,462
  • 3
  • 17
  • 39