5
import praw

def get_data_reddit(search):
    username=""
    password=""
    r = praw.Reddit(user_agent='')
    r.login(username,password,disable_warning=True)
    posts=r.search(search, subreddit=None,sort=None, syntax=None,period=None,limit=None)
    title=[]
    for post in posts:
        title.append(post.title)
    print len(title)


search="stackoverflow"
get_data_reddit(search)
        

Ouput=953

Why the limitation?

  1. [Documentation][1] mentions

We can at most get 1000 results from every listing, this is an upstream limitation by reddit. There is nothing we can do to go past this limit. But we may be able to get the results we want with the search() method instead.

Any workaround? I hoping someway to overcome in API, I wrote an scraper for twitter data and find it to be not the most efficient solution.

Same Question:https://github.com/praw-dev/praw/issues/430 [1]: https://praw.readthedocs.org/en/v2.0.15/pages/faq.html Please refer the aformentioned link for related discussion too.

Abhishek Bhatia
  • 9,404
  • 26
  • 87
  • 142
  • 1
    This is a relatively commonplace practice for API's to stop people from overloading the severs with requests. You can normally get around it by making your search queries more specific and looping through a defined set, e.g. keep the queries to a specific day, and loop through the last ten days, or whatever reddit will allow that can work in this way, – NDevox Jun 23 '15 at 11:04
  • @Scironic Thanks! This seems a much better solution than a scrapper. Can you provide an example to elucidate. It would be greatly helpful. Maybe going through when reddit started to current time. – Abhishek Bhatia Jun 23 '15 at 11:07

2 Answers2

8

Limiting results on a search or list is a common tactic for reducing load on servers. The reddit API is clear that this is what it does (as you have already flagged). However it doesn't stop there...

The API also supports a variation of paged results for listings. Since it is a constantly changing database, they don't provide pages, but instead allow you to pick up where you left off by using the 'after' parameter. This is documented here.

Now, while I'm not familiar with PRAW, I see that the reddit search API conforms to the listing syntax. I think you therefore only need to reissue your search, specifying the extra 'after' parameter (referring to your last result from the first search).

Having subsequently tried it out, it appears PRAW is genuinely returning you all the results you asked for.

As requested by OP, here's the code I wrote to look at the paged results.

import praw

def get_data_reddit(search, after=None):
    r = praw.Reddit(user_agent='StackOverflow example')
    params = {"q": search}
    if after:
        params["after"] = "t3_" + str(after.id)
    posts = r.get_content(r.config['search'] % 'all', params=params, limit=100)
    return posts

search = "stackoverflow"
post = None
count = 0
while True:
    posts = get_data_reddit(search, post)
    for post in posts:
        print(str(post.id))
        count += 1
    print(count)
Peter Brittain
  • 13,489
  • 3
  • 41
  • 57
  • Thanks for reply! But can tell me how to achieve this is python. – Abhishek Bhatia Jun 27 '15 at 21:57
  • 1
    Looking at the PRAW code, you just need to add 'after=' to you existing call to search(), where is constructed as per http://www.reddit.com/dev/api#fullnames – Peter Brittain Jun 27 '15 at 22:28
  • @Thanks for the reply again! I am sorry a naive to python. I don't understand what you exactly mean. How can I do what you have advised. Can you please edit it help me understand better. – Abhishek Bhatia Jun 27 '15 at 22:54
  • Can you please elucidate more. Should the full name refer to the link of search url? – Abhishek Bhatia Jun 28 '15 at 09:17
  • Can you please reply. I am still stucked. – Abhishek Bhatia Jun 28 '15 at 18:44
  • 1
    OK, so I had a quick play and found that paging works, if you go to get_content instead. Then I found that search already does the paging for you. S then I repeated the search on reddit and paged to the end... Guess what? There were [953 results](https://www.reddit.com/search?q=stackoverflow&count=950&after=t3_1c2lkl). The short answer is therefore that your search is complete. – Peter Brittain Jun 28 '15 at 23:27
  • Can you provide some code on how you did the paging. I am not sure if it paged it till the end. Check this https://www.reddit.com/search?q=stackoverflow&restrict_sr=&sort=relevance&t=all&count=1000 – Abhishek Bhatia Jun 28 '15 at 23:32
  • Please provide some code, it helps better to cross check and understand your methodology. – Abhishek Bhatia Jun 28 '15 at 23:36
  • 1
    Given that I've just been voted down twice (without any explanation), I'm not sure that this is worth following, but I'll have one last go... [Your URL](https://www.reddit.com/search?q=stackoverflow&restrict_sr=&sort=relevance&t=all&count=1000) returns exactly the same results as [the simple search](https://www.reddit.com/search?q=stackoverflow&restrict_sr=&sort=relevance&t=all) with just the entries relabeled to start at 1,000. The count is just a mechanism to provide a consistent view to the user and not a definitive index. – Peter Brittain Jun 29 '15 at 00:06
  • 2
    I also think that there should be some explanation why this answer has been voted down. – Visgean Skeloru Jun 29 '15 at 00:17
  • Sorry for downvoting quickly, I just got very confused. Thanks for pointing out my mistake with the count variable. If you check the question I had asked all results, I sincerely doubt Reddit has less than 1000 posts containing the word `StackOverflow`. I quickly tried a few other popular search entries as well. It seems Reddit always returns less thousand results. – Abhishek Bhatia Jun 29 '15 at 00:37
  • But this doesn't answer my answer completely. It seems using some other search engine like google could be the only option. What do you suggest? – Abhishek Bhatia Jun 29 '15 at 00:42
  • 1
    Looks like the reddit API is deliberately restricted here... I can't see a way guarantee to break down any arbitrary search and so find all posts containing any keyword. A site-restricted search on Google is probably therefore your best bet. – Peter Brittain Jun 29 '15 at 09:11
  • @PeterBrittain Can you give an example please. I can't find it on web. – Abhishek Bhatia Jun 29 '15 at 09:32
  • 1
    You're now asking a very different question, which has been [answered before](http://stackoverflow.com/questions/1657570/google-search-from-a-python-app). I don't think there's anything else to be covered in your original question now and so we should close this trail. – Peter Brittain Jun 29 '15 at 09:49
  • @PeterBrittain It's certainly a different methodology, but the question remains the same. It's my mistake to start with the wrong method at first. – Abhishek Bhatia Jul 03 '15 at 12:08
0

So I would simply loop through a predetermined set of search queries, I'm assuming period is a time period? I'm also not sure what the format for it would be, so the below is largely made up, but you should get the gist.

In which case it would be something like the following

import praw

def get_data_reddit(search):
    username=""
    password=""
    r = praw.Reddit(user_agent='')
    r.login(username,password,disable_warning=True)
    title=[]

    periods = (time1, time2, time3, time4)  # declare a set of times to use in the search query to limit results

    for period in periods:  # loop through the different time points and query the posts from that time.
        posts=r.search(search, subreddit=None,sort=None, syntax=None,period=None,limit=None)  # this now returns a limited search query.

        for post in posts:
            title.append(post.title)  # and append as usual.
    print len(title)


search="stackoverflow"
get_data_reddit(search)
NDevox
  • 4,056
  • 4
  • 21
  • 36
  • Thanks! I am looking for something like get the date of last reddit related to that query. Then checking each month/year from their now or some better way to calculate intervals. This seems more like a pseudocode than actual code. – Abhishek Bhatia Jun 23 '15 at 11:32
  • For example: One starts from the time reddit started (23 June 2005) and then calculate the number posts in the current year if greater 1000 you further divide in months and so on. It would be a better solution I suspect. – Abhishek Bhatia Jun 23 '15 at 11:34
  • 1
    I reckon reddit will receive more than a thousand posts in an hour, considering it can get around a million comments a day, so it would have to be cut down to quite small timescales - it would take a very large number of requests, and a long time, to calculate the number of posts in a year, – NDevox Jun 23 '15 at 13:36
  • 1
    As mentioned in the answer, this isn't meant to be working code, I don't know the reddit API and I don't have a clear understanding of what you want to do. But the best way of getting around data limits is by using a more specific query - how you are going to do that is something you need to figure out as i can't say. – NDevox Jun 23 '15 at 13:37
  • This seems to suggest something similar https://www.reddit.com/r/redditdev/comments/30a7ap/does_reddit_api_limit_total_listings_returned_to/ – Abhishek Bhatia Jun 23 '15 at 16:56