5

I have this table,

person_id   int(10) pk
points      int(6) index
other columns not very important

I have this random function which is very fast on a table with 10M rows:

SELECT person_id
  FROM persons AS r1 JOIN
       (SELECT (RAND() *
                     (SELECT MAX(person_id)
                        FROM persons)) AS id)
        AS r2
 WHERE r1.person_id >= r2.id
 ORDER BY r1.person_id ASC
 LIMIT 1

This is all great but now I wish to show only people with points > 0. Example table:

PERSON_ID      POINTS
1              4
2              6
3              0
4              3

When I append AND points > 0 to the where clause, person_id 3 can't be selected, so a gap is created and when the random select person_id 3, person_id 4 will be selected. This gives person 4 a bigger chance to be chosen. Any one got suggestions how I can adjust the query to make it work with the where clause and give all rows same % of chance to be selected.

Info table: The table is uniform, no gaps in person_id's. About 90% will have 0 points. I want to make the query for where points = 0 and points > 0.

Before someone will say, use rand(): this is not solution for tables with more than a few 100k rows.

Bonus question: will it be possible to select x random rows in 1 query, so I do not have to call this query a few times when i want more random rows?

Important note: performance is key, with 10M+ rows the query may not take much longer than the current query, which takes 0.0005 seconds, I prefer to stay under 0.05 second.

Last note: If you think the query will never be this fast with above requirements, but another solution is possible (like fetching 100 rows and showing x random which has more than 0 points), please tell :)

Really appreciate your help and all help is welcome :)

Kevin Vermaat
  • 311
  • 2
  • 4
  • 18
  • What is the original task? Is PK an artificial (sequence based) integer? How sparse is it? – zerkms May 22 '13 at 10:50
  • Table with 10M rows, need 4 random rows with at least 1 point, the person_id is uniform (1,2,3,4 etc, no gaps), but some might have 0 points, so they create gaps.. – Kevin Vermaat May 22 '13 at 10:52
  • So there is no primary key at all? – zerkms May 22 '13 at 10:52
  • yes there is, person_id = pk, points can be 0, person_id not – Kevin Vermaat May 22 '13 at 10:53
  • just a side note (irrelevant to this particular question but still a useful advice): for innodb it's required to have primary key that grows monotonously – zerkms May 22 '13 at 10:55
  • 1
    Are you again in need of pure-sql solution? Because this is an example of perfect task for simply utilizing a cache which you will update every now and then (and then you won't care that it takes 2 minutes to run). It may not be a best practice, but it's practical. – Tymoteusz Paul May 22 '13 at 10:55
  • Sorry Puciek, I am working on a school project and databases is not my strongest skill, however I want to learn and after reading the book on school i just have a few more questions :/ – Kevin Vermaat May 22 '13 at 10:57
  • i've always used `order by newid()` to randomize output in t-sql. isn't there something similar in mysql? – rocky May 22 '13 at 11:00
  • That is fine kevin, and you can notify people of your reply by putting "@" in front of their name (as we don't get notification on comments that are for you). – Tymoteusz Paul May 22 '13 at 11:05
  • You _do_ know that if `person_id` isn't continuous / gapless, your stated bias for certain records above other records is already happening, even without your `WHERE` clause? – Wrikken May 22 '13 at 11:08
  • 1
    @Wrikken yes i know that, the person_id is gapless – Kevin Vermaat May 22 '13 at 11:10
  • What percentage of rows match the `WHERE`? – Martin Smith May 22 '13 at 11:14
  • I'd go into simple application mode: `SELECT COUNT(*) FROM tablename WHERE ....`, `SELECT * FROM tablename WHERE ... LIMIT FLOOR(resultofthatcount*RAND()),1`. However, ordering might take a while.. – Wrikken May 22 '13 at 11:17
  • Is this an InnoDB or MyISAM table? InnoDB will make better use of your points index. – Steven Moseley May 22 '13 at 11:28
  • @StevenMoseley It's MyISAM since it's on shared hosting and innodb needs some tuning which i cant do since the shared webhost ;) – Kevin Vermaat May 22 '13 at 11:58

1 Answers1

1

You could generate in-line gap-free id's for the records that you really want to work with, and generate then the random selector using the total number of records available.

Try with this (props to the chosen answer here for the row_number generator):

    SELECT r1.*
    FROM
        (SELECT  person_id,
                 @curRow := @curRow + 1 AS row_number
        FROM persons as p,
             (SELECT @curRow := 0) r0
        WHERE points>0) r1
    , (SELECT COUNT(1) * RAND() id
       FROM persons
       WHERE points>0) r2
    WHERE r1.person_id>=r2.id
    ORDER BY r1.person_id ASC
    LIMIT 1;

You can mess with it in this sqlfiddle.

Community
  • 1
  • 1
listik
  • 229
  • 1
  • 5