0

I have a like query that’s processing millions of rows:

SELECT
    sample_id,
    REPLACE( sample_id, '*', '') AS term
FROM
    sample.table
WHERE
    sample_id LIKE '%*%'
ORDER BY
    sample_id ASC;

I tried batching the queries but its still too slow to process. Have someone experienced this in the past and successfully solved this? I’m basically open to any ideas at this point. Thanks!

Rei
  • 313
  • 2
  • 11
  • This is unrelated to PHP. Add result of `explain` query. Tag RDBMS (mysql?) – user3783243 Nov 17 '22 at 01:22
  • 5
    You can't really speed-up `LIKE` prefix-and-suffix queries because it's not possible to index strings for that (at least, notwithout building your own suffix-tree, which is non-trivial and I don't believe any RDBMS implements that in-box). You can speed up `LIKE` prefix queries as that's trivially indexable, and `LIKE`-suffix queries by indexing a `REVERSE()` representation of the text. – Dai Nov 17 '22 at 01:24
  • Use Elastic Search to do the search instead of using `like` on RDBMS. – tim Nov 17 '22 at 01:28
  • If you tag a question with `sql`, then you must tag the RDBMS too. – Jeff Holt Nov 17 '22 at 01:30
  • @tim Integrating an entire textual search-engine system is overkill for a problem like this (not to mention introducing problems inherent in having to keep the search engine's index in-sync with table data). I suspect they just need to destructure their `sample_id` column such that it can be indexed appropriately. – Dai Nov 17 '22 at 01:55
  • It seems that this `LIKE` isn't actually helping you at all. `REPLACE` internally must already be doing something like that `LIKE '%*%'` operation in order to actually apply the replacement. You're doing some of the same work twice! This should be a pretty fast operation on "millions of rows" in Postgres on modern hardware. Unless this is bottlenecking your application, I would suggest just using `REPLACE` without the `WHERE ... LIKE` and not worrying further about its performance. If performance *is* a problem, then perhaps you need something more intelligent than `REPLACE`. – shadowtalker Nov 17 '22 at 02:37
  • 3
    Install the pg_trgm extension and create an index. This can also support a LIKE search. https://www.postgresql.org/docs/15/pgtrgm.html#id-1.11.7.44.8 – Frank Heikens Nov 17 '22 at 03:44
  • @Dai As Frank Heikens just mentioned, the `pg_trgm` extension helps. Also see https://stackoverflow.com/a/13452528/479863 which discusses the small pattern issue. – mu is too short Nov 17 '22 at 03:45

1 Answers1

-1

You did not mention which RDBMS you are using, but you can speed up processing by using properly designed index.

Index properties (basing on Microsoft SQL Server RDBMS):

  1. filtered index:

    you can implement a filtered index. Filter corresponds to the WHERE clause from your query. You can add "sample_id LIKE '%*%'" as a filter condition.

  2. covering index:

    your query is not complicated, so it should be easy to create a covering index for it. By covering index I mean a structure which will contain all the columns which are mentioned in your query, it will help the RDBMS engine to decide to use it during execution becaue it will contain all the needed columns, and the filter also, as mentioned in the first point.

So the syntax could look like this (Microsoft SQL Server pseudo code):

CREATE INDEX idx1 ON your_table_name (sample_id) WHERE sample_id LIKE '%*%'

If you would build it, you would have a DEDICATED structure for your query. You can think of it as of a subset of the data from your table, but physicaly present in your database, written to disk and being constantly updated as the data changes. As long as this index has the filter, it contains only the rows needed by your query. So you can imagine that if the RDBMS engine would choose it - by parsing and analyzing your code - the WHERE clause would not have to execute.

Unfortunatelly, I am not aware if other RDBMSes than Microsoft SQL Server deliver filtered indexes.

If your RDBMS doesn't allow for filtered indexes you can at least create a covering one. Still it might be lighter structure than your table, however, you didn't present the structure of your table.

An index doesn't come without a cost but this is a further story. Just remember that it takes place on disk and is being updated along with the data in your table.

Bartosz Siemasz
  • 296
  • 1
  • 10
  • Ok, so if it is postgres then here you have the syntax for the index creation -> https://www.postgresql.org/docs/current/sql-createindex.html , it looks like it has everything I've mentioned – Bartosz Siemasz Nov 17 '22 at 02:53
  • 2
    This won't work effectively: you can't index for `LIKE '%*%'` - this isn't an RDBMS limitation, it's a universal fact of life: all this filtered-index will do is replace a table-scan with an index-scan, which can be just as slow, especially if `STATISTICS` says the index [has a low selectivity](https://www.erikdarlingdata.com/predicate-selectivity-and-index-design/). – Dai Nov 17 '22 at 03:14
  • I really think that it needs checking if you can't add the filter in that form. At least in Microsoft SQL Server I don't know about such limitation. – Bartosz Siemasz Nov 17 '22 at 10:46
  • You can't use `LIKE` in a filtered index's predicate in SQL Server: https://learn.microsoft.com/en-us/sql/relational-databases/indexes/create-filtered-indexes?view=sql-server-ver16 – Dai Nov 17 '22 at 10:50
  • Indeed, you are right, LIKE is not allowed. Maybe he can think of some kind of a substitute – Bartosz Siemasz Nov 17 '22 at 14:47
  • I imagine you could use a computed column and index that - though I’d use CHARINDEX or PATINDEX instead of LIKE as it’s more useful. – Dai Nov 17 '22 at 14:50