1

I have two tables wherein one of the columns contain the data which is comma separated. I need to compare both the columns. The comma separated fields can be in any order, but the order is irrelevant.

create table x_a (id1 integer, shared text);
create table x_b (id1 integer, shared text);

insert into x_a values 
  (1, 'A,B,C,D,E')
, (2, 'A,B,C,D,E');

insert into x_b values 
  (1, 'B,A,C,E,D')
, (2, 'B,A,C,E');

I used below query, but it is not returning any output:

select a.id1,b.id1, a.shared, b.shared
from x_a a ,x_b b
where a.id1 = b.id1 
and regexp_split_to_array(LOWER(a.shared),',')
  = regexp_split_to_array(LOWER(b.shared),',')

I cannot use the operator && as it will return id=2 which is wrong as the "Shared" column is not the exact copy.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • because `regexp_split_to_array(LOWER(a.shared),',') != regexp_split_to_array(LOWER(b.shared),',')`, since `abcde` != `baced` and `abcde` != `bace` , respectively. – Barbaros Özhan Mar 19 '19 at 10:13
  • 1
    Change your database/table design. Storing comma separated values violates the fundamental rules of relational databases. – Kaushik Nayak Mar 19 '19 at 10:38

2 Answers2

3

I cannot use the operator && as it will return id=2 which is wrong ...

But you can use the array operators @> and <@ like this:

SELECT id1, a.shared AS a_shared, b.shared AS b_shared
FROM   x_a a
JOIN   x_b b USING (id1)
WHERE  string_to_array(a.shared, ',') @> string_to_array(b.shared, ',')
AND    string_to_array(a.shared, ',') <@ string_to_array(b.shared, ',');

If A contains B, and B contains A, both are equal - ignoring duplicates.

You might want to store (sorted) arrays to begin with - or normalize your DB design with a 1:n relationship.

If your elements are integer numbers, consider the additional intarray module for superior performance. See:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
0

The = for array depends on the order of the element, so you need to sort the elements when you turn them into an array. I think the easiest way is to do that with your own function.

create function string_to_array_sorted(p_input text)
  returns text[]
as
$$
  select array_agg(t.x order by t.x)
  from unnest(string_to_array(lower(p_input), ',')) as t(x);
$$
language sql
immutable;

Note that string_to_array() is much more efficient than regexp_split_to_array() if you don't need a regex.

With the above function your query can be written as:

select a.id1,b.id1, a.shared, b.shared
from x_a a 
  join x_b b on a.id1 = b.id1 
where string_to_array_sorted(a.shared) = string_to_array_sorted(b.shared);

Note that I replaced the ancient, outdated and fragile implicit join with a "modern" (30 years old) explicit JOIN operator.

You can create an index to speed this up:

create index on x_a (id1, string_to_array_sorted(shared));
create index on x_b (id1, string_to_array_sorted(shared));
  • Thank you, it worked. But I do have Crores of data in the table, so I would have preferred avoiding a function call – Preeti Kaur Mar 19 '19 at 10:19
  • You already **have** a function call (`regexp_split_to_array`) - but you have to sort the elements somehow. Whether you do that in a function or "inline" doesn't really matter. The overhead of a `language sql` (especially when declared `immutable`) is next to none compared to "inlining" the same functionality (where you would need an additional `group by`) statement. The function gives you the ability to create an index on those values. If you really have millions of rows, this might actually be faster than anything you "inline" in your statement –  Mar 19 '19 at 10:23
  • This function is returning the output in Curly braces. basically it is converting the string to an array. For example string " A,B,D,C,E' is getting converted to array and we are getting the output as " {a,b,c,d,e} – Preeti Kaur May 30 '19 at 08:59
  • @PreetiKaur: yes, it's returning an array - the same that `regexp_split_to_array` returns –  May 30 '19 at 09:42