Improving query speed: simple SELECT with LIKE

Question

I have inherited a large legacy codebase which runs in django 1.5 and my current task is to speed up a section of the site which takes ~1min to load.

I did a profile of the app and got this:

The culprit in particular is the following query (stripped for brevity):

SELECT COUNT(*) FROM "entities_entity" WHERE (
  "entities_entity"."date_filed" <= '2016-01-21' AND (
    UPPER("entities_entity"."entity_city_state_zip"::text) LIKE UPPER('%Atherton%') OR
    UPPER("entities_entity"."entity_city_state_zip"::text) LIKE UPPER('%Berkeley%') OR
    -- 34 more of these
    UPPER("entities_entity"."agent_city_state_zip"::text) LIKE UPPER('%Atherton%') OR
    UPPER("entities_entity"."agent_city_state_zip"::text) LIKE UPPER('%Berkeley%') OR
    -- 34 more of these
  )
)

which basically consist on a big like query on two fields, entity_city_state_zip and agent_city_state_zip which are character varying(200) | not null fields.

That query is performed twice (!), taking 18814.02ms each time, and one more time replacing the COUNT for a SELECT taking up an extra 20216.49 (I'm going to cache the result of the COUNT)

The explain looks like this:

Aggregate  (cost=175867.33..175867.34 rows=1 width=0) (actual time=17841.502..17841.502 rows=1 loops=1)
  ->  Seq Scan on entities_entity  (cost=0.00..175858.95 rows=3351 width=0) (actual time=0.849..17818.551 rows=145075 loops=1)
        Filter: ((date_filed <= '2016-01-21'::date) AND ((upper((entity_city_state_zip)::text) ~~ '%ATHERTON%'::text) OR (upper((entity_city_state_zip)::text) ~~ '%BERKELEY%'::text) (..skipped..) OR (upper((agent_city_state_zip)::text) ~~ '%ATHERTON%'::text) OR (upper((agent_city_state_zip)::text) ~~ '%BERKELEY%'::text) OR (upper((agent_city_state_zip)::text) ~~ '%BURLINGAME%'::text) ))
        Rows Removed by Filter: 310249
Planning time: 2.110 ms
Execution time: 17841.944 ms

I've tried using an index on entity_city_state_zip and agent_city_state_zip using various combinations like:

CREATE INDEX ON entities_entity (upper(entity_city_state_zip));
CREATE INDEX ON entities_entity (upper(agent_city_state_zip));

or using varchar_pattern_ops, with no luck.

The server is using something like this:

qs = queryset.filter(Q(entity_city_state_zip__icontains = all_city_list) |
                     Q(agent_city_state_zip__icontains = all_city_list))

to generate that query.

I don't know what else to try,

Thanks!

`LIKE` queries, which starts with `'%...'` won't use any btree index (including `xxx_pattern_ops`). These indexes only picked, if the pattern matches at the start. (f.ex. `col LIKE 'XXX%'` or `col ~ '^XXX'`). You can try the [`pg_trgm` module](http://www.postgresql.org/docs/current/static/pgtrgm.html), [which provides a suitable index for you](http://dba.stackexchange.com/questions/10694/pattern-matching-with-like-similar-to-or-regular-expressions-in-postgresql/10696). (and you can use `ilike` instead of `like` & `lower()`/`upper()` calls). — pozs, Jan 21 '16 at 15:56
I'd at least want to know what effect the `Seq Scan` was having, and whether an index scan could be substituted. See what effect `set enable_seqscan=false` has on the plan. Is DB running off a SSD? — Andrew Regan, Jan 21 '16 at 16:08
@AndrewRegan yes, the test was made on my mac which has an SSD (in production). Setting `enable_seqscan` to false yields: `Aggregate (cost=175867.33..175867.34 rows=1 width=0) (actual time=20916.498..20916.498 rows=1 loops=1) -> Seq Scan on entities_entity (cost=0.00..175858.95 rows=3351 width=0) (actual time=0.192..20871.984 rows=145075 loops=1)` — nicosantangelo, Jan 21 '16 at 18:26
Ah OK, that had no effect, which suggests the planner had no alternative index it could possibly use. I guess that makes my next suggestion - lowering `random_page_cost` to 1.1 or so, to tell the planner that it's fast enough to random-access a potential index on your fast disk rather then seq scan - redundant. — Andrew Regan, Jan 21 '16 at 19:00

score 1 · Accepted Answer · edited May 23 '17 at 11:59

1

I think problem in "multiple LIKE" and in UPPER("entities_entity ...

You can use:

WHERE entities_entity.entity_city_state_zip SIMILAR TO '%Atherton%|%Berkeley%'

Or something like this:

WHERE entities_entity.entity_city_state_zip LIKE ANY(ARRAY['%Atherton%', '%Berkeley%'])

Edited

About Raw SQL query in Django:

Regards

edited May 23 '17 at 11:59

Community

1
1

answered Jan 21 '16 at 15:41

Volodymyr Matvienko

79
5

I didn't know `LIKE` supported `ANY` with an array as a value. The problem I have is making `django` create that query I'll google a bit to see what I can find – nicosantangelo Jan 21 '16 at 16:00
It's postgres man) supported) What about django... I think "raw query" it's what you want https://docs.djangoproject.com/es/1.9/topics/db/sql/ and read this link too http://stackoverflow.com/questions/31698103/how-do-i-execute-raw-sql-in-a-django-migration – Volodymyr Matvienko Jan 21 '16 at 18:00

Hambone · Answer 2 · 2016-01-22T04:12:58.703

I watched a course in Pluralsight that addressed a very similar issue. The course was "Postgres for .NET Developers" and this was in the section "Fun With Simple SQL", "Full Text Search."

To summarize their solution, using your example:

Create a new column in your table that will represent your entity_city_state_zip as a tsvector:

create table entities_entity (
  date_filed date,
  entity_city_state_zip text,
  csz_search tsvector not null   -- add this column
);

Initially you might have to make it nullable, then populate the data and make it non-nullable.

update entities_entity
set csz_search = to_tsvector (entity_city_state_zip);

Next, create a trigger that will cause the new field to be populated any time a record is added or modified:

create trigger entities_insert_update
before insert or update on entities_entity
for each row execute procedure
tsvector_update_trigger(csz_search,'pg_catalog.english',entity_city_state_zip);

Your search queries can now query on the tsvector field rather than the city/state/zip field:

select * from entities_entity
where csz_search @@ to_tsquery('Atherton')

Some notes of interest on this:

to_tsquery, in case you haven't used it is WAY more sophisticated than the example above. It allows and conditions, partial matches, etc
it is also case-insensitive, so there is no need to do the upper functions you have in your query

As a final step, put a GIN index on the tsquery field:

create index entities_entity_ix1 on entities_entity
using gin(csz_search);

If I understand the course right, this should make your query fly, and it will overcome the issue of a btree index's inability to work on a like '% query.

Here is the explain plan on such a query:

Bitmap Heap Scan on entities_entity  (cost=56.16..1204.78 rows=505 width=81)
  Recheck Cond: (csz_search @@ to_tsquery('Atherton'::text))
  ->  Bitmap Index Scan on entities_entity_ix1  (cost=0.00..56.04 rows=505 width=0)
        Index Cond: (csz_search @@ to_tsquery('Atherton'::text))

It's pretty awesome. I did some quick tests on around 2,000,000 rows of data and this method took around 300 ms vs around 2.4 seconds for a traditional query. With the nested "or" queries on larger datasets, I'll bet the differences will be much more dramatic. — Hambone, Jan 22 '16 at 16:17

Improving query speed: simple SELECT with LIKE

2 Answers2