0

I have two tables.

Table1 has two columns: brand and review_counter.

Table2 also has two columns: brand and review.

There are several reviews for each brand. Is there any way in SQL to randomly select around 10% of reviews for each brand, without using "top n" command?

For example for 'Sony' there are 2,005 reviews, and I need to select 10% of them, 200 reviews.

Thank you in advance.

Isaac Bennetch
  • 11,830
  • 2
  • 32
  • 43
Emely_sh
  • 47
  • 1
  • 8

2 Answers2

2

Possible method using user counters. Compared to a solution similar to that above this is likely to bring back a more accurate 10% of reviews per brand, but it also likely to be slower (neither solution will be fast as both rely on using RAND() on every row on the tables).

This gets all the rows, ordered by brand and then RAND(). It uses that as a sub query and adds a sequence number, resetting back to 1 for the first record for each brand. Then that in turn is used as the source for a query which eliminates records where the generated sequence number is <= to a tenth of the reviews for that brand.

SELECT sub1.brand, sub1.review
FROM
(
    SELECT sub0.brand, sub0.reviews_wanted, sub0.review, @cnt:=IF(@brand = brand, @cnt+1, 1) AS cnt, @brand := brand
    FROM
    (
        SELECT Table1.brand, (Table1.review_counter * 0.1) AS reviews_wanted, Table2.review
        FROM Table1 
        INNER JOIN Table2 
        ON Table1.brand = Table2.brand
        ORDER BY Table1.brand, RAND()
    ) sub0
    CROSS JOIN (SELECT @cnt:=0, @brand:='') sub2
) sub1
WHERE cnt <= sub1.reviews_wanted

EDIT.

This might be a bit more memory efficient (although probably slower).

This has a sub query that gets the unique id of all the reviews for a brand in a random order, along with a count that is 1 tenth of the number of reviews for the brand. It then uses the count with SUBSTRING_INDEX to get the ids of the first random 10%, and joins that using FIND_IN_SET with the reviews table.

SELECT sub0.brand, Table2.review
FROM
(
    SELECT Table1.brand, CEIL(Table1.review_counter * 0.1) AS reviews_wanted, GROUP_CONCAT(Table2.id ORDER BY RAND()) AS id
    FROM Table1 
    INNER JOIN Table2 
    ON Table1.brand = Table2.brand
    GROUP BY Table1.brand, reviews_wanted
) sub0
INNER JOIN Table2
ON FIND_IN_SET(Table2.id, SUBSTRING_INDEX(sub0.id, ',', reviews_wanted))

You might be able to do something a bit more efficient using one of the solutions here:-

How can i optimize MySQL's ORDER BY RAND() function?

Community
  • 1
  • 1
Kickstart
  • 21,403
  • 2
  • 21
  • 33
  • Could you please elaborate more on your solution? what are the sub1, sub0, and reviews_wanted? – Emely_sh Jun 25 '14 at 12:52
  • These are just alias names. A sub query being used as a source for a query (ie, as a pseudo table) must have an alias name. In this query sub0 and sub1 are just alias names for sub queries. The name reviews_wanted is just an alias name for a calculated value, in this case it contains 1/10th of Table1.review_counter for a brand (or it would if there dot wasn't missing - just fixing that now!). – Kickstart Jun 25 '14 at 13:01
  • Thanks for your explanation, after running your query, I've got an error "Error in Process Request - Error Code: 500 - Error text: Internal Server Error" in phpmyadmin. I'm trying to solve the problem to see the query result. – Emely_sh Jun 25 '14 at 13:21
  • That sounds like a server error rather than an SQL error. Might be easiest to export the table declares and a few rows of data and set up an sqlfiddle. – Kickstart Jun 25 '14 at 13:25
  • Added a suggestion that might be a bit more memory efficient. – Kickstart Jun 27 '14 at 09:42
  • I got this error: Unknown column 'sub0.id' in 'on clause' – Emely_sh Jun 28 '14 at 01:50
  • Added a fix (missed out putting the alias) – Kickstart Jun 30 '14 at 08:09
  • Thank you so much. The result is close to my expectation. – Emely_sh Jul 01 '14 at 13:13
1

RAND() generates random values between 0 and 1. Why don't you try this?

UPDATED2

SELECT review
FROM 
( 
    SELECT review, (review_counter * RAND()) / review_counter AS rand
    FROM Table1 INNER JOIN Table2 ON Table1.brand = Table2.brand
) t
WHERE rand < 0.1
Jason Heo
  • 9,956
  • 2
  • 36
  • 64
  • When I run it in my DB in phpmyadmin, I've got an error "Error in Process Request - Error Code: 500 - Error text: Internal Server Error". I've just got this error when I run your updated2 and Kickstart's query. I've uploaded a small portion of my DB on (http://www.sqlfiddle.com/#!2/61801). I didn't get the error but the query didn't work well. – Emely_sh Jun 26 '14 at 01:31
  • 1
    @Emely_sh As Kickstart said, 'internal server error' is related to server. Is there any error in apache error log?. And "with small portion of data", it might be not working, because there are small random number (means random numbers are not distributed well) – Jason Heo Jun 26 '14 at 01:39
  • I tried other queries but I did not get the error! I couldn't upload whole DB bcz it's too big and I've got this error when I've tried to upload more "Your schema ddl is too large (more than 8000 characters). Please submit a smaller DDL." – Emely_sh Jun 26 '14 at 01:48
  • This is the error in apache error log: "PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 59919581 bytes) in /usr/share/phpmyadmin/libraries/DisplayResults.class.php on line 2646, referer: http://localhost/phpmyadmin/import.php". – Emely_sh Jun 26 '14 at 02:06
  • @Emely_sh PHP needs more than 128MB memory to print output of sql, increase memory_limit larger than 128MB in php.ini if you are sysadmin. Or call ini_set() http://bestphphosting.net/how-to-increase-php-memory-limit/ – Jason Heo Jun 26 '14 at 02:13
  • I've changed memory_limit to 384M in php.ini but the problem still exists. – Emely_sh Jun 26 '14 at 02:21
  • @Emely_sh PHP has inefficient memory management of Array type. It needs more memory as expected in C or Java. So you need to increase more more memory if getting a same error. – Jason Heo Jun 26 '14 at 02:28
  • @Emely_sh and, check how many records are returned by query like this: SELECT COUNT(*) FROM ( SELECT review, (review_counter * RAND()) / review_counter AS rand FROM Table1 INNER JOIN Table2 ON Table1.brand = Table2.brand) t WHERE rand < 0.1 – Jason Heo Jun 26 '14 at 02:28
  • @Emely_sh Then give some LIMITs to check SQL produces right results: SELECT review FROM ( SELECT review, (review_counter * RAND()) / review_counter AS rand FROM Table1 INNER JOIN Table2 ON Table1.brand = Table2.brand) t WHERE rand < 0.1 LIMIT 1000; – Jason Heo Jun 26 '14 at 02:29
  • @Emely_sh it seems that `review` column has long string value. Could you try this? SELECT review FROM ( SELECT SUBSTR(review, 1, 100) AS review, (review_counter * RAND()) / review_counter AS rand FROM Table1 INNER JOIN Table2 ON Table1.brand = Table2.brand) t WHERE rand < 0.1 – Jason Heo Jun 26 '14 at 02:35
  • Same error using SELECT SUBSTR(review, 1, 100) although I've increased the memory_limit to 2000M! – Emely_sh Jun 26 '14 at 02:48
  • How long are these reviews? Does Table2 have a unique id field? – Kickstart Jun 26 '14 at 09:17
  • On average, each review has 250 words. Yes, it has. – Emely_sh Jun 27 '14 at 00:22
  • I've fixed the error by storing the result of the sub query in a new table and then I've run main query on the table. But the results are not exactly what I want. Because it doesn't return exactly 10% of reviews for EACH brand, there is no review for some brands in result. – Emely_sh Jun 27 '14 at 07:00
  • Think the problem with both this and my solutions might be the amount of data that the sub query lands up storing (although I would expect MySQL to cope with it, might be you have some odd settings). 250 words per review probably means at least 1.5k bytes per review, hence just for Sony 3mb of reviews to be stored in a temp table. – Kickstart Jun 27 '14 at 09:31