-1

Is it possible to quickly select random rows from a table, while also using a where condition?

Example:

SELECT * FROM geo WHERE placeRef = 1 ORDER BY RAND() LIMIT 1

This can take 10+ seconds.

I found this, which is sometimes quick, sometimes very slow:

(SELECT *
FROM geo
INNER JOIN ( SELECT RAND() * ( SELECT MAX( nameRef ) FROM geo ) AS ID ) AS t ON geo.nameRef >= t.ID
WHERE geo.placeRef = 1
ORDER BY geo.nameRef
LIMIT 1)

This provides a quick result, only if there is no extra where condition.

This is the create table:

CREATE TABLE `geo` (
 `nameRef` int(8) DEFAULT NULL,
 `placeRef` mediumint(7) unsigned DEFAULT NULL,
 `category` enum('continent','country','region','subregion') COLLATE utf8_bin DEFAULT NULL,
 `parentRef` mediumint(7) DEFAULT NULL,
 `incidence` int(9) unsigned NOT NULL,
 `percent` decimal(11,9) unsigned DEFAULT NULL,
 `ratio` int(11) NOT NULL,
 `rank` mediumint(7) unsigned DEFAULT NULL,
 KEY `placeRef_rank` (`placeRef`,`rank`),
 KEY `nameRef_category` (`nameRef`,`category`),
 KEY `nameRef_parentRef` (`nameRef`,`parentRef`),
 KEY `nameRef_placeRef` (`nameRef`,`placeRef`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_bin

N.B. this table has around 550 million rows.

Desired query: query the table where placeRef = x; and then quickly return one row.

Issue: a query like SELECT * FROM geo WHERE placeRef = 1 can provide up to about 15 million results. So selecting a single random row is slow.

Kohjah Breese
  • 4,008
  • 6
  • 32
  • 48

2 Answers2

1

That technique is variable because it depends on where the matching rows happen to lie in the table.

The quick fix may be to add this index, assuming that nameRef is the PRIMARY KEY for the table:

INDEX(placeRef, nameRef)

Let's discuss this further after

There are (currently) 3 indexes that make this subquery very fast (because of the leading nameRef):

( SELECT MAX( nameRef ) FROM geo )

After that, my suggestion of (placeRef, nameRef) will kick in for these:

WHERE geo.placeRef = 1
geo.nameRef >= t.ID

I think the resulting query should be consistently fast.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Added above. The `nameRef` is not relevant. I am seeking to get a random nameRef. There is no primary key for the table, as there is no need to identify rows on their own. placeRef, nameRef is a unique index. – Kohjah Breese Apr 08 '21 at 15:03
  • 1
    @KohjahBreese - Every table needs a `PRIMARY KEY`. If the pair `(placeRef, nameRef)` is unique and not null, make it the PK. That will possibly speed up your query. Note that I deliberately put `placeRef` first. – Rick James Apr 08 '21 at 15:14
  • @KohjahBreese - However, due to the likely uneven distribution of the nameRef values, your query will produce a not-so-random result. – Rick James Apr 08 '21 at 15:16
0

This is pulling a result in 1/100th of a second:

SELECT * FROM geo where placeRef = 1 AND nameRef >= CEIL( RAND() * ( SELECT MAX( nameRef ) FROM forenameGeo ) ) LIMIT 1

This works well if you have an index on both the columns you would like to query. However, you may need to make a new table that is randomly ordered. In my table the nameRefs tend to be grouped by country. This causes the random results to be selected from a handful of results as most of the resulted are grouped around the same Id. I needed to create a new table ordered randomly ORDER BY RAND() where each row had a unique Id. Now I search this much smaller summary table with:

SELECT * FROM geoSummary where placeRef = 1 AND nameRef >= CEIL( RAND() * ( SELECT MAX( id ) FROM geoSummary ) ) LIMIT 1

Though to cut that SELECT MAX query running all the time I have saved the maximum Id in the server-side code, generate the random number there and run:

SELECT * FROM geoSummary where placeRef = 1 AND nameRef >= :random_number LIMIT 1

This provides truly random results.

Kohjah Breese
  • 4,008
  • 6
  • 32
  • 48