0

Suppose I have a large (300-500k) collection of text documents stored in the relational database. Each document can belong to one or more (up to six) categories. I need users to be able to randomly select documents in a specific category so that a single entity is never repeated, much like how StumbleUpon works.

I don't really see a way I could implement this using slow NOT IN queries with large amount of users and documents, so I figured I might need to implement some custom data structure for this purpose. Perhaps there is already a paper describing some algorithm that might be adapted to my needs?

Currently I'm considering the following approach:

Read all the entries from the database Create a linked list based index for each category from the IDs of documents belonging to the this category. Shuffle it Create a Bloom Filter containing all of the entries viewed by a particular user Traverse the index using the iterator, randomly select items using Bloom Filter to pick not viewed items.

Plic Pl
  • 491
  • 3
  • 7
  • 16
  • 2
    If there's that many documents and that few categories, why do you need to guarantee that it never repeats? Why can't you just return them randomly and rely on the statisticly low chance that there will be a repeat? – RBarryYoung May 13 '13 at 16:30
  • Which, by the way, is probably how SU works - I've seen plenty of repeats on there – im so confused May 13 '13 at 16:34

2 Answers2

0

I would recommend a hash table implementation. This will ensure that you get constant time look ups. You can implement a technique known as linear probing. A linked list is an awful implementation, because you eat O(N) on search time.

If the constraint is that it has to be a relational database, you can use things such as memcache (FB uses this), to keep what is essentially the friend of a friend problem.

Woot4Moo
  • 23,987
  • 16
  • 94
  • 151
0

See this answer for how to return entries from a database in random order. Now you can simply track where in the random sequence a user has been shown up to (for each category) and Skip() and Take() to get the next group of records to show them. You can store the random XOR value per user so each one sees a different sequence.

Community
  • 1
  • 1
Ian Mercer
  • 38,490
  • 8
  • 97
  • 133