4

I'm using a varchar[] column (varchar array) in Postgres 9.2 to store some tags. While retrieving rows by tags, I want the query to be case insensitive. However, I want to preserve the case to display in the UI (therefore I can't just store everything as lower case).

So, my question is how do I create a case-insensitive index in Postgres over a varchar array? One possible approach would be to create a functional GIN index on the column. How does one do that? Any other approaches?

Saurabh Nanda
  • 6,373
  • 5
  • 31
  • 60
  • 2
    Bad idea to use an array to store the tags. – Clodoaldo Neto Apr 15 '13 at 15:10
  • @ClodoaldoNeto, why do you say that? – Saurabh Nanda Apr 15 '13 at 15:22
  • Search for relational model – Clodoaldo Neto Apr 15 '13 at 15:28
  • 2
    @ClodoaldoNeto, though this may not be perfectly normalized, it can be a perfectly valid production design pattern. I've often had situations where denormalizing into arrays makes sense in terms of both space and speed. – marcj Apr 15 '13 at 16:01
  • @marcj denormalized is not the same as not normalized – Clodoaldo Neto Apr 15 '13 at 16:04
  • @ClodoaldoNeto: thanks for your comment. I don't usually like to do this without careful thought and planning. Having said that, this is good case for using an array and I've had good results doing same. – marcj Apr 15 '13 at 16:09
  • @marcj: One might think to save space with an array. However, a Postgres array has [24 bytes of overhead](http://stackoverflow.com/questions/2966524/calculating-and-saving-space-in-postgresql/7431468#7431468) - plus space for actual values. For a couple of simple tags - usually `integer` values of 4 bytes - it is generally more efficient to use separate columns, nullable if you want to save disk space. `NULL` occupies one bit in the NULL bitmask. An integer array only starts to save disk space with more than ~ 30 columns (depends on null columns). Separate rows need more space, of course. – Erwin Brandstetter Apr 15 '13 at 19:02
  • @ErwinBrandstetter: Good points. If there is a limited number of tags and they are relatively fixed it would of course be much more efficient to use columns. I am thinking of tags as being dynamic - this is case I've often run into - and continually adding columns would become unmanageable (and rows inefficient). There are tradeoffs here and best answer really will depend on the data and use. – marcj Apr 16 '13 at 22:05
  • 1
    @ErwinBrandstetter: Though question was about a varchar array, an interesting alternative I've been using lately has been to store dynamic data in JSON type (using json_accessors extension/plV8) and GIN indices. This has it's own trade-offs but a quick test shows storage requirements are indeed lower. This is another path that poster might explore. – marcj Apr 16 '13 at 22:27
  • @marcj: Of course there are cases. I was only addressing the disk space requirement for arrays. Sometimes it's also a matter of convenience .. – Erwin Brandstetter Apr 16 '13 at 22:27
  • @ErwinBrandstetter: Again, great points and perspective - thanks! – marcj Apr 16 '13 at 22:30

3 Answers3

6

@Saurabh Nanda: Similar to what you posted, you can also create a simple function to convert your varchar array to lowercase as follows:

CREATE OR REPLACE FUNCTION array_lowercase(varchar[]) RETURNS varchar[] AS
$BODY$
  SELECT array_agg(q.tag) FROM (
    SELECT btrim(lower(unnest($1)))::varchar AS tag
  ) AS q;
$BODY$
  language sql IMMUTABLE;

Note that I'm also trimming the tags of spaces. This might not be necessary for you but I usually do for consistency.

Testing:

SELECT array_lowercase(array['Hello','WOrLD']);
 array_lowercase 
-----------------
 {hello,world}
(1 row)

As noted by Saurabh, you can then create a GIN index:

CREATE INDEX ix_tags ON tagtable USING GIN(array_lowercase(tags));

And query:

SELECT * FROM tagtable WHERE ARRAY['mytag'::varchar] && array_lowercase(tags);

UPDATE: Performance of WHILE vs array_agg/unnest

I created table of 100K 10 element text[] arrays (12 character random mixed case strings) and tested each function.

The array_agg/unnest function returned:

EXPLAIN ANALYZE VERBOSE SELECT array_lowercase(data) FROM test;
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.test  (cost=0.00..28703.00 rows=100000 width=184) (actual time=0.320..3041.292 rows=100000 loops=1)
   Output: array_lowercase((data)::character varying[])
 Total runtime: 3174.690 ms
(3 rows)

The WHILE function returned:

EXPLAIN ANALYZE VERBOSE SELECT array_lowercase_while(data) FROM test;
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.test  (cost=0.00..28703.00 rows=100000 width=184) (actual time=5.128..4356.647 rows=100000 loops=1)
   Output: array_lowercase_while((data)::character varying[])
 Total runtime: 4485.226 ms
(3 rows)

UPDATE 2: FOREACH vs. WHILE As a final experiment, I changed the WHILE function to use FOREACH:

CREATE OR REPLACE FUNCTION array_lowercase_foreach(p_input varchar[]) RETURNS varchar[] AS $BODY$
DECLARE
    el text;
    r varchar[];
BEGIN
    FOREACH el IN ARRAY p_input LOOP
        r := r || btrim(lower(el))::varchar;
    END LOOP;
    RETURN r;
END;
$BODY$
  language 'plpgsql'

Results appeared to be similar to WHILE:

EXPLAIN ANALYZE VERBOSE SELECT array_lowercase_foreach(data) FROM test;
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.test  (cost=0.00..28703.00 rows=100000 width=184) (actual time=0.707..4106.867 rows=100000 loops=1)
   Output: array_lowercase_foreach((data)::character varying[])
 Total runtime: 4239.958 ms
(3 rows)

Though my tests are not by any means rigorous, I did run each version a number of times and found the numbers to be representative, suggesting that the SQL method (array_agg/unnest) is the fastest.

marcj
  • 336
  • 1
  • 5
  • Any thoughts on efficiency of a SQL query vs a WHILE loop in converting an array to lowercase? – Saurabh Nanda Apr 17 '13 at 06:26
  • @SaurabhNanda: I tested both methods and updated my answer. The SQL method was consistently faster. I also tried a version using the `FOREACH` statement ( [see here](http://www.postgresql.org/docs/9.2/interactive/plpgsql-control-structures.html#PLPGSQL-FOREACH-ARRAY) ): I'd need to test with a larger sample but results were similar to `WHILE`. – marcj Apr 17 '13 at 13:29
  • thank you for taking the time to run some benchmarks on this. Any idea why the `SELECT` query is faster than `FOREACH` & `WHILE`? – Saurabh Nanda Apr 18 '13 at 07:18
1

Found one possible approach using a custom pgplsql function:

First declare a custom function that takes a varchar[] array as input and returns a new array with all elements converted to lowercase. (This is the first time I'm writing PL/SQL, so this might be very inefficient code).

CREATE OR REPLACE FUNCTION array_lowercase(varchar[]) RETURNS varchar[] AS $$
DECLARE
    i INTEGER;
    l INTEGER;
    r VARCHAR[];
    inp ALIAS FOR $1;
BEGIN
    i := 1;
    l := array_length($1, 1);
    WHILE i <= l LOOP
        r[i] = lower(inp[i]);
        i := i + 1;
    END LOOP;
    RETURN r;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

Next, create a GIN index on the expression using the newly defined array_lowercase function:

create index hotel_bookings_tags on hotel_bookings using gin(array_lowercase(tags));

Now use it in a query (verify that it's using the index using EXPLAIN):

select * from posts where array[(varchar 'some_tag')] && array_lowercase(tags);
Saurabh Nanda
  • 6,373
  • 5
  • 31
  • 60
0

not sure it helps, but I was searching something similar for text[], and used type casting:

select id from product where lower(tags::text)::text[] && array['tat'];
Florian Klein
  • 8,692
  • 1
  • 32
  • 42