2

Does PostgreSQL have some way for me to find if an array is contained within another array, but with the same ordering?
For example, I want to know if array1 is within array2 with matching elements in the same order.

array1[1, 3, 6, 8]
array2[3, 8, 2, 9, 10, 1, 6]

Obviously not the case in the example, but is there a built-in method for this in PostgreSQL or should I create my own function?

Version of PostgreSQL is 9.6. The actual numbers the query will run on are bigints.

Philip
  • 95
  • 1
  • 9
  • 1
    if you compare arrays in order, why not compare `array2::text like concat('%',array1::text,'%')` ?.. – Vao Tsun Mar 01 '17 at 13:36
  • 1
    @VaoTsun, If `array2[1,4,6,10,12]` and `array1[1,6,12]` then `array2::text like concat('%',array1::text,'%')` would be false, but it should be true according to OP's requirements. However `array2::text like concat('%',array_to_string(array1, '%'),'%')` may work (I haven't tested). – JNevill Mar 01 '17 at 13:42
  • yes - I did not think about possible gaps in order - true :) – Vao Tsun Mar 01 '17 at 13:43
  • Ah, didn't even think about using the like operator, but that's clever. The second solution is spot on. Thanks a bunch! – Philip Mar 01 '17 at 13:47
  • 2
    @Philip: the `like` solution will not work for e.g. `array[1,2]` and `array[11,2]` –  Mar 01 '17 at 14:38
  • For integer arrays or for other types (too)? Are duplicate elements possible? Do you use this test for queries on (big) tables? And please always declare your version of Postgres. – Erwin Brandstetter Mar 01 '17 at 15:34
  • @ErwinBrandstetter It's on bigint arrays only. Duplicate elements are possible as the integers represent an intersection (road network) and they could possible drive in circles. The queries will have to sift through at least 250,000 rows, possible more. PostgreSQL version 9.6. – Philip Mar 02 '17 at 08:22

1 Answers1

2

General case

All elements of the second array are in the first, too. In the same order, but gaps are allowed.

I suggest this polymorphic PL/pgSQL function:

CREATE OR REPLACE FUNCTION array_contains_array_in_order(arr1 ANYARRAY
                                                       , arr2 ANYARRAY
                                                       , elem ANYELEMENT = NULL)
  RETURNS bool AS
$func$
DECLARE
   pos int := 1;
BEGIN
   FOREACH elem in ARRAY arr2
   LOOP
      pos := pos + array_position(arr1[pos:], elem);  -- see below
      IF pos IS NULL THEN
         RETURN FALSE;
      END IF;
   END LOOP; 

   RETURN true; --  all elements found in order
END
$func$ LANGUAGE plpgsql IMMUTABLE COST 3000;

As @a_horse commented, we can omit the upper bound in array subscripts to mean "unbounded" (arr1[pos:]). In older versions before 9.6 substitute with arr1[pos:2147483647] - 2147483647 = 2^31 - 1 being the theoretical max array index, the greatest signed int4 number.

This works for ...

About the ANYELEMENT trick:

Performance

I ran a quick performance test comparing this function to the one @a_horse supplied. This one was around 5x faster.

If you use this a filter for a big table I strongly suggest you combine it with a (logically redundant) sargable filter like:

SELECT *
FROM   tbl
WHERE  arr @> '{2,160,134,58,149,111}'::int[] 
AND    array_contains_array_in_order(arr, '{2,160,134,58,149,111}')  

This will use a GIN index on the array column like:

CREATE INDEX ON tbl USING gin (arr);

And only filter remaining (typically very few!) arrays that share all elements. Typically much faster.

Caveats with intarray module

Note: applies to integer[] exclusively, not smallint[] or bigint[] or any other array type!

Careful, if you have installed the intarray extension, which provides its own variant of the @> operator for int[]. Either you create an (additional) GIN index with its special operator class (which is a bit faster where applicable):

CREATE INDEX ON intarr USING gin (arr gin__int_ops);

Or, while you only have a GIN index with the default operator class, you must explicitly denote the standard operator to cooperate with the index:

WHERE  arr OPERATOR(pg_catalog.@>) '{2,160,134,58,149,111}'::int[] 

Details:

Simple case

As commented, your case is simpler:

The complete second array is included in the first (same order, no gaps!).

CREATE OR REPLACE FUNCTION array_contains_array_exactly(arr1 ANYARRAY, arr2 ANYARRAY)
  RETURNS bool AS
$func$
DECLARE
   len int := array_length(arr2, 1) - 1;  -- length of arr2 - 1 to fix off-by-1
   pos int;                               -- for current search postition in arr1
BEGIN
   /*                                     -- OPTIONAL, if invalid input possible
   CASE array_length(arr1, 1) >  len      -- array_length(arr2, 1) - 1 
   WHEN TRUE  THEN                        -- valid arrays
      -- do nothing, proceed
   WHEN FALSE THEN                        -- arr1 shorter than arr2
      RETURN FALSE;                       -- or raise exception?
   ELSE                                   -- at least one array empty or NULL
      RETURN NULL;
   END CASE;
   */

   pos := array_position(arr1, arr2[1]);  -- pos of arr2's 1st elem in arr1

   WHILE pos IS NOT NULL
   LOOP
      IF arr1[pos:pos+len] = arr2 THEN    -- array slice matches arr2 *exactly*
         RETURN TRUE;                     -- arr2 is part of arr1
      END IF;

      pos := pos + array_position(arr1[(pos+1):], arr2[1]);
   END LOOP; 

   RETURN FALSE;
END
$func$ LANGUAGE plpgsql IMMUTABLE COST 1000;

Considerably faster than the above for longer arrays. All other considerations still apply.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • 1
    In Postgres 9.6 `arr1[pos:2147483647]` can be replaced with `arr1[pos:]` –  Mar 01 '17 at 21:12
  • @a_horse_with_no_name: Yes, that's better. I integrated your improvement above, thanks. – Erwin Brandstetter Mar 01 '17 at 23:28
  • This runs really fast even without the index you mention. I think I need to test some more, but this seems really promising. – Philip Mar 02 '17 at 08:39
  • 1
    @Philip: You commented on the other (now deleted answer): `Ah, my question should've been more clear. I want to have the exact same sequence. So {3,1,6} and {3,8,2,9,10,1,6} would be false, however {3,1,6} and {3,8,2,9,10,3,1,6} would be true.` This solution is for the more general case. *Your* case is simpler. Please edit your question to include ***all*** relevant information. – Erwin Brandstetter Mar 02 '17 at 15:28
  • 1
    I added a solution for your simpler problem. – Erwin Brandstetter Mar 02 '17 at 16:23