2

Here is a simulation of the basic setup i have: each person can hold multiple possessions.
Persons table:

id  name
1   Carl
2   Sam
3   Tom
4   Jack

Possessions table:

possession personId
car        2
shoes      2
shovel     2
tent       3
matches    3
axe        4

I want to generate a random set of possessions belonging to a random set of people, one possession per person.

So, in a non-SQL world I would generate a set of N random people and then pick a random possession for each person in the set. But how do I implement that in SQL semantics?

I thought of getting a random sample of possessions with some variation of:

SELECT * FROM Posessions WHERE 0.01 >= RAND()

And then filtering out duplicate persons, but that is no good as it will favor persons with large number of possessions in the end, and I want each person to have equal chance of being selected.

Is there a canonical way to solve this?

P.S. Person contains ~50000 entities and Possession contains ~2500000 entities, but i only need to perform this sampling once, so it can be somewhat slow.

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317

3 Answers3

0

Why don't you take random set of persons and join to posessions ranked by random. Something like below. Sorry if it contain any spelling error but I don't have DB to check it now:

    select * from (
    (select top 1 percent * from persons order by newid()) a
    inner join
    (select p.*, ROW_NUMBER() OVER (partition by personId order by newid()) r from posessions p) b
    on (a.personId = b.personId) 
)
    where r = 1;
Kacper
  • 4,798
  • 2
  • 19
  • 34
0

One way would be (for 2 persons below and one possession per person)

DECLARE @PeopleCount INT = 2, 
        @PossessionsPerPersonCount INT = 1;

SELECT *
FROM (SELECT TOP (@PeopleCount) *
FROM Persons
ORDER BY CRYPT_GEN_RANDOM(4)) RandomPersons 
OUTER APPLY (SELECT TOP (@PossessionsPerPersonCount) * FROM Posessions p
             WHERE RandomPersons.id = p.personId
             ORDER BY CRYPT_GEN_RANDOM(4)) RandomPosessions

Hopefully Possession has an index on personId so that it can seek into the relevant rows per person (average 50) rather than scanning all 2,500,000 in the table for each person.

I've used OUTER APPLY above as not all the people in your example data have possessions (i.e. Carl doesn't).

If you only want to include people with possessions and want one possession per person you can use this instead.

DECLARE @PeopleCount INT = 2;

SELECT TOP (@PeopleCount) *
FROM Persons
CROSS APPLY (SELECT TOP (1) * FROM Posessions p
             WHERE Persons.id = p.personId
             ORDER BY CRYPT_GEN_RANDOM(4)) RandomPosessions
ORDER BY CRYPT_GEN_RANDOM(4);
Martin Smith
  • 438,706
  • 87
  • 741
  • 845
  • Martin Please do you think you could take a look at this question : http://stackoverflow.com/questions/40197865/mssql-is-there-any-performance-penalty-for-wrapping-a-select-query-in-a-transa – eddy Oct 22 '16 at 23:37
0

The following query will generate 3 random sample for you


    SELECT p.id,
    (SELECT posession FROM posessions p1 where p1.id=p.id ORDER BY RAND() LIMIT 1) as posession
    FROM posessions p
    GROUP BY p.id
    ORDER BY RAND()
    LIMIT 3

The sub-query generate random posession of each person, while the outer-query generate random person.

ArtCHNK
  • 46
  • 5