1

I'm building an app to show random questions to users by categories. Each user should vote on question by "yes", "no" or "na". The app counts the votes per questions, each user can vote once per question.

The questions should appear randomly to the users and shouldn't appear more than once per user (user can't proceed without voting).

models.py:

class Category(models.Model):
    name = models.CharField(max_length=500, null=False)
    parent = models.ForeignKey("self", null=True, default=None)

class Question(models.Model):
    question = models.CharField(max_length=500, null=False, blank=False)
    title = models.CharField(max_length=100, null=False, blank=False)
    category = models.ForeignKey(Category, null=True, default=None, blank=True)
    no_count = models.BigIntegerField(default=0)
    yes_count = models.BigIntegerField(default=0)
    na_count = models.BigIntegerField(default=0)
    user = models.ForeignKey(User, null=True, default=None)
    rand = models.FloatField(null=True, default=0)

    def save(self, *args, **kwargs):
        self.rand = random.random()
        super(Picture, self).save(*args, **kwargs)

class Vote(models.Model):
    VOTE_CHOICES = (
        (1, 'Yes'),
        (2, 'No'),
        (3, 'N/A'),
    )
    user = models.ForeignKey(User)
    question = models.ForeignKey(Question, null=True, default=None)
    user_vote = models.IntegerField(choices=VOTE_CHOICES)

class UserSettings(models.Model):
    user = models.OneToOneField(User)
    categories = models.CommaSeparatedIntegerField(max_length=1000, null=True)

views.py:

class GetQuestions(generics.ListAPIView):
    model = Question
    serializer_class = QuestionSerializer

    def get_queryset(self):
        user = self.request.user
        lookup = dict()
        categories = user.usersettings.categories
        if categories is None:
            categories = Category.objects.filter(~Q(parent=None)).values_list('id', flat=True)
        else:
            categories = ast.literal_eval(categories)
        lookup['category__in'] = categories
        voted = Vote.objects.filter(user=self.request.user).values_list('question')
        questions = Question.objects.filter(**lookup).exclude(id__in=voted).order_by('rand')
        return questions

class NewVote(generics.CreateAPIView):
    model = Vote
    serializer_class = VoteSerializer

    def post(self, request, *args, **kwargs):
        current_vote = Vote.objects.filter(user=request.user, picture=int(self.request.DATA['question']))
        if current_vote:
            return HttpResponseForbidden()
        return super(NewVote, self).post(request, *args, **kwargs)

    def pre_save(self, obj):
        obj.user = self.request.user

    def post_save(self, obj, created=False):
        if created:
            vote_count = obj.vote.get_user_vote_display().lower().replace(" ", "")
            vote_count += "_count"
            count = getattr(obj.picture, vote_count)
            setattr(obj.picture, vote_count, count + 1)
            obj.picture.save()

On vote I just increment the relevant count on the question. My questions are:

  1. What is the best way to choose random question? Currently I've added random field on question and use order_by('rand') - is there a better way?
  2. What is the best way to choose question of categories that the user? currently I'm using filter category__in
  3. The most important one - How do I exclude questions that the user already voted on? currently I just select all questions from the vote table which user = request.user and using "NOT IN" - when scaling this is surely won't be good...

Concepts, code sample, links are welcome.

Many thanks

eladelad
  • 99
  • 2
  • 10
  • can the same question appear to more users? – Mihai Zamfir Dec 17 '14 at 13:35
  • Yes, same question can (and should) appear to different users. – eladelad Dec 17 '14 at 13:41
  • Isn't it better if you have a design where ``Vote`` has ``ManyToManyField`` with ``User`` with ``through='Question'``? – Mihai Zamfir Dec 17 '14 at 13:50
  • You mean instead of having user as a foreign key on vote, just make it manytomany with though question and than remove the question field? I can do that of course, but what will I gain from this? I will still have the same problems, no? – eladelad Dec 17 '14 at 13:56
  • You said that q User can have more questions and a same question can appear to more users, so isn;t this the case of a ManyToMany? The design I specified earlier seems more natural in my point of view – Mihai Zamfir Dec 17 '14 at 14:01
  • You need to ask one question in one post (you can create multiple questions) - one post = one question that can be answered. Please not this is not a discussion form, but a question answer site. Asking for "concepts, code sample, links" is a sign you are looking for a discussion. – Burhan Khalid Dec 21 '14 at 07:34

6 Answers6

2

Random things is always a tradeoff with a relation database. There is no definite answer, it all depends on what exactly you need.

Do you actually need randomness? It looks for your case you could just pick the next unanswered question. Unless the user creates multiple accounts, there is no way he would notice. The only issue I would see is if you need answers spread amongst questions.

Is your application going to handle a limited number of open questions at any given time? If so, and you don't expect much traffic, you can stick with the order_by('?'). Anything lower than a couple thousand open questions will be fine.

Next step: drop the sorting The sorting is really expensive. Because it will generate all rows, attach a random value to them, then sort all of them, and only after that you pick the first row. That gets expensive very quickly. Your next option is picking things yourself.

questions = list(Question.objects.filter(whatever_condition))
return random.sample(questions, 10)

You still load every single question in memory. If they have alot of data, or parsing them is a bit complex, that will still be expensive.

Next step: drop the loading First little tradeoff here: we will make two queries.

question_ids = Question.objects.filter(whatever_condition).values_list('id', flat=True)
questions = Question.objects.filter(pk__in=random.sample(question_ids, 10))
return questions

The first query returns all valid ids, the second one picks a random set of ids and load the full instances for those.

Next step: it gets messy Depending on your specific conditions, there are many things you can do to improve it. Some ideas:

  • If you do not delete questions, you could pick ids blindly, by picking random integers between 1 and the total number of questions, excluding those the user has already answered.
  • Same if you delete questions but not much, just retry until you get lucky. For this approach to work, you need to have a relatively low ratio of deleted questions.
  • If you know you will delete lots of questions, but you also have a good idea of the spread of valid questions, you could partition your questions to hit a range with a known average number of valid questions.
  • Or you could store a random integer within known limits in every row. That gives you homogeneous distribution, so you can target a range and know about how many items that will fetch. For instance, your integer is in [1..100000]. If you have 5000 valid questions, and you want to get 10, you pick a random x and filter questions in the [x..x+400] range. You get about 20, you keep 10 (don't forget you might be very unlucky and get less, handle that). Don't forget to wrap at the upper bound.

The last option should perform really fast read as the number can be indexed.

spectras
  • 13,105
  • 2
  • 31
  • 53
1
question = Question.objects.filter(category__in=categories).exclude(vote_set__user=user).order_by('?')[0]

This should work.

filter will ensure that the question you get is from the desired categories

exclude will exclude all questions the user has already voted on

order_by('?') will order the queryset in a random way.

elssar
  • 5,651
  • 7
  • 46
  • 71
  • Thanks, regarding ('?') - I've seen here: https://docs.djangoproject.com/en/dev/ref/models/querysets/ that is it very slow and expensive: "Note: order_by('?') queries may be expensive and slow, depending on the database backend you’re using." will it be better than sort by a field? – eladelad Dec 17 '14 at 14:03
  • @eladelad TBH you can do without any sort at all. Since you're guaranteed to get questions the user hasn't voted on, you can take the first item you get by default. – elssar Dec 17 '14 at 16:51
  • You don't need randomize the whole set. That is expensive. Instead of returning the first element (index '0'), use python random.choice to randomize the instance. – Daniel Dec 18 '14 at 20:00
1
questions = list(Question.objects.filter(category__in=categories)) 

store questions list in session

"What is the best way to choose random question?"

Use python random.choice to randomise questions list.

"How do I exclude questions that the user already voted on"

every time a user answers a question do questions.remove(question) and save it back to the session.

Arpit Singh
  • 3,387
  • 2
  • 17
  • 11
0

select * from question order by rand()

will sort all the rows by random. If the table has many rows, the query will be slow. Here has some approch: How can i optimize MySQL's ORDER BY RAND() function?

Community
  • 1
  • 1
Rong
  • 63
  • 2
  • 4
0
import random
questions = Question.objects.filter(category__in=categories).exclude(vote_set__user=user)
rand_question = random.choice(questions)
Daniel
  • 645
  • 7
  • 11
  • 1
    please add some explanation to your code, explaining what it does and how it solves the problem. This will help others in the future who see your answer – Our Man in Bananas Dec 18 '14 at 20:25
0

Having your users logged is the solely way you can force the "one user one vote" constraint, regardless you can mimic this constrain in other ways using server-side or client-side code but both would be easy to hack.

Regarding performance just do not select all question at once since, hopefully, you're not sending all that info to your user in one big server response but you're using ajax to load the next question.

In a scalable webapp you do not put much hard to compute nor memory expansive code on the replying side of the code, so you surely would not use python to randomize your data but you will use your database to randomize these, said that probably you would have to check the random performance of your database backend first using django orm and then doing a raw query.

The django query should look like this:

Question.objects.exclude(replies__user=current_user).order_by('?')[:10] 

The query is, indeed, expensive because you are filtering on a MtM field and then doing a random ordering, but for a start is good enough, if you find it yet too slow you can do something like this

num = some integer 
rando = some random bigger or equal to num, and smaller than the Question table length 
Question.objects.all()[rando-num: rando].exclude(replies__user=current_user).order_by('?') 

So that you have a double random a fast one that tell your database that you need only a (random but your database does not know that ;] ) subset of the Question table and then doing all that expensive work only on that, mostly you will not want your subset (including joins etc.) to exceed your L2 memory in size while the entire table could easily exceed that (the random thing could prove harder if you have to pass info between ram and cache).

The trade of is that you have no assurance that there is any result, so that an user that has still questions available could not be able to see any of them in a given request, but you could do a, somewhat expensive, extra query if this is the case,.