8

I have a database of around 8 million plus rows from which I want to select randomly n rows. First of all I've read the popular and similar question here on StackOverflow and the article on MSDN, however I feel that the answers still don't suit my needs.

The offered solutions work great if I want a certain percentage of rows randomly selected without extra conditions. But what I want to select n rows randomly (e.g. at most 5 rows), all matching a certain condition.

My database contains words with information like their part of speech, tag, lemma and token. Now I want to perform a query to select 5 random words all similar to the word in the query (e.g. give me 5 words similar to fuzzy), this is determined by looking only at words with the same part of speech and a value of the levenshtein distance above a certain threshold. I have a function in sql server that can calculate the levenshtein distance.

The problem with the aforementioned methods is that they either have to run over all the records and calculate the levenshtein distance (which takes up a lot of time!) or they only offer me to select a percentage instead of n rows.

A query that works sufficiently well is:

SELECT DISTINCT TOP 5 lower(Words.TOKEN) as LTOKEN, Words.LEMMA, TagSet.POS_Simplified, TagSet.TAG 
FROM Words JOIN TagSet on Words.TAG = TagSet.TAG 
WHERE NOT Words.LEMMA = 'monarchie' AND TagSet.POS_Simplified = 'noun' 
AND TagSet.TAG = 'NOM' AND NOT Words.TOKEN = 'monarchie'
AND [dbo].edit_distance('monarchie', Words.Token) > 0.5

However, with only top I get always the same results. I need my top to be random. Methods like using NEWID() will first go over the entire database and then select randomly, which is not my intended behavior as they take way too long.

Does anyone have an idea to select n random rows fast on a huge database?


EDIT:

Someone (not on StackOverflow) may have offered me a solution with the OPTION clause and the fast keyword, which retrieves the first n number of rows it finds.

With the OPTION(fast 5) I get the best performance so far (10 seconds on a 8 million plus row table). I also changed the Levenshtein function from an SQL implementation to a c# written library implementation which sped up the performance considerably.

Select top 5 * from (
SELECT DISTINCT lower(Words.TOKEN) as LTOKEN, Words.LEMMA, TagSet.POS_Simplified, TagSet.TAG 
FROM Words JOIN TagSet on Words.TAG = TagSet.TAG 
WHERE NOT Words.LEMMA = 'monarchie' AND TagSet.POS_Simplified = 'noun' 
AND TagSet.TAG = 'NOM' AND NOT Words.TOKEN = 'monarchie'
AND [dbo].clrEditDistance('monarchie', Words.Token) > 0.5
) AS T
ORDER BY NEWID()
OPTION(fast 5)
Community
  • 1
  • 1
Floris Devriendt
  • 2,044
  • 4
  • 24
  • 34
  • ADD TO end of your QUERY: ORDER BY NEWID() Sort operation last in query execute. Or try this query: SELECT * FROM () AS T ORDER BY NEWID() – realnumber3012 Nov 04 '13 at 09:25
  • apart from the random-ness aspect of your question, isn't it better to use the built-in fulltext capabilities (CONTAINS, etc.) of SQLServer? – davek Nov 04 '13 at 09:26
  • @realnumber3012 Your first suggestion will still go over all my records and it takes around 5 minutes to complete (which is basically the same performance when I would run the query for every record instead of a select few). Your second suggestion will only randomize the 5 resulting rows from my query which is not what I want. I want 5 random rows from the table, not a random order of the top 5 rows. – Floris Devriendt Nov 04 '13 at 09:41
  • @davek To be honest, I didn't realize the full capabilities of the CONTAINS function of SQLServer. I'm not sure if it will match my needs, but I'll definitely look into it. – Floris Devriendt Nov 04 '13 at 09:42
  • TOP N ORDER_BY NEWID() doesn't just randomize the order of the top N; it implicitly assigns a NEWID to each row, and since NEWID is well-distributed this will give you a random (well, pseudorandom) sets of N rows out of your subquery. I worry about this "OPTION(fast 5)" from a randomness perspective; it seems to me that allowing SQL Server pick a convenient ordering for obviates the pseudorandom ordering that ORDER BY NEWID() gives you. – Robert Calhoun Jan 16 '14 at 20:15
  • Actually since NEWIDs are unique, it gives you "pseudorandom without replacement". See [random vs shuffle discussion](http://stackoverflow.com/questions/5772108/select-n-random-records-in-sql-server-without-repetition) – Robert Calhoun Jan 16 '14 at 20:33

4 Answers4

1

Avoiding a full scan is going to be difficult. If you had a column that you could randomly select easily, say for example you happened to have a "dense" identity column with few gaps, replace Klark's approach with the following modification:

declare @results table (id bigint, name varchar(100))

while (select count(*) from @results) < 5
    begin
    insert  @results
            (name)
    select  name
    from    (
            select  *
            from    dbo.Words
            WHERE  IDCOLUMN = CONVERT(INT,RAND()) * APPX_NUMBER_OF_ROWS
            ) as SubQueryAlias          
    where   dbo.edit_distance(left(name,4), 'APRS', 100) < 3
    end  

select  *
from    @results)
0

In order go get random data you need to go through all the rows that matches your where clause. The seek will be done only on the rows that matches your where expression so it won't be the full table seek. If you have a lot of records that match your search you can do something like:

select top 5 * from
(
SELECT DISTINCT TOP 1000 lower(Words.TOKEN) as LTOKEN, Words.LEMMA, TagSet.POS_Simplified, TagSet.TAG 
FROM Words JOIN TagSet on Words.TAG = TagSet.TAG 
WHERE NOT Words.LEMMA = 'monarchie' AND TagSet.POS_Simplified = 'noun' 
AND TagSet.TAG = 'NOM' AND NOT Words.TOKEN = 'monarchie'
AND [dbo].edit_distance('monarchie', Words.Token) > 0.5
) order by newid();

But of course, this will not be truly random.

Klark
  • 8,162
  • 3
  • 37
  • 61
0

From your question I assume you know that many rows will match your edit_distance > 0.5 condition. But SQL Server does not know that. A way to share that infromation with SQL Server is to write a more explicit query using table variables.

declare @results table (id bigint, name varchar(100))

while (select count(*) from @results) < 5
    begin
    insert  @results
            (name)
    select  name
    from    (
            select  top 100 *
            from    dbo.Words
            order by
                    newid()
            ) as SubQueryAlias          
    where   dbo.edit_distance(left(name,4), 'APRS', 100) < 3
    end  

select  top 5 *
from    @results

The snippet above selects 100 random rows at a time, and inserts those that match into the result table. It loops until it has found 5 rows. At the end, it selects 5 rows from the result table.

This should be more efficient if you have many matching rows, but far less efficient if there are few.

Andomar
  • 232,371
  • 49
  • 380
  • 404
  • I have to admit I'm not experienced in SQL Server and I'm not that familiar with the way you came up to your solution. Why does the second select statement also select name? – Floris Devriendt Nov 04 '13 at 14:38
  • The second select query provides new rows to be inserted in `@results`. It has to match the insert list. In my example, the insert list is `(name)`, so the query just selects `name`. – Andomar Nov 04 '13 at 14:40
0

I think there is a fundamental limit on how fast you can do what you're looking for. If you want to pick records from the table quickly you need a way to use an index. Let's say you have a sequential integer column ID and it is the clustered index: you could select records with distinct random ID values you generate but you have no guarantee that every ID between MIN(ID) and MAX(ID) gives you a row in the table so you may end up with fewer rows than you asked for.

One thing you might do is take the query which has the conditions you want applied and add a row_number (see this technet article) thus giving you a sequential "key" that has no "holes" in it... select random integers within the bounds of that key. You will then be dealing with a subset of the table rather than the whole table but I suspect that may be the best you can do performance-wise.

You could deal with "holes" in ID by writing a table valued function that uses a loop (just keep selecting random values until you get the number of results you want) but that is inelegant and could have concurrency problems depending on the access patterns on your database. So it depends on your requirements in those regards.