tricky SQL with substrings

Question

I have a table (postgres) with a varchar field that has content structured like:

".. John;Smith;uuid=7c32e9e1-e29e-4211-b11e-e20b2cb78da9 .."

The uuid can occur in more than one record. But it must not occur for more than one combination of [givenname];[surname], according to a business rule.

That is, if the John Smith example above is present in the table, then if uuid 7c32e9e1.. occurs in any other record, the field in that record most also contain ".. John;Smith; .."

The problem is, this business rule has been violated due to some bug. And I would like to know how many rows in the table contains a uuid such that it occurs in more than one place with different combinations of [givenname];[surname].

I'd appreciate if someone could help me out with the SQL to accomplish this.

Is the length of the uuid fixed (that is, can you search on the last x number of characters of the string to find the uuid) or is it variable length? Obviously the entire string is variable length because of the names, but if the uuid is fixed, then you can look for the substring at the right and see if it appears more than once when you `GROUP BY` the entire field. — daShier, Oct 21 '19 at 13:41
*a varchar field that has content structured like...* - it's just horrible. You should start by healing the table structure and splitting different parts of the field into new columns. Without this anything you want to do with the table will be tricky forever. — klin, Oct 21 '19 at 13:45
Is this a duplicate of https://stackoverflow.com/questions/2594829/finding-duplicate-values-in-a-sql-table ? — Bill Karwin, Oct 21 '19 at 13:54
Is the order of the (semicolon separated) elements always the same? (Or to put it the other way: if there is another string `Smith` in that line - how do you know which one is the name and which one is something else? — , Oct 21 '19 at 14:09
Postgres supports UUID as a native type. It can be either 32 or 36 characters depending on formatted or not. It always contains 32 hex digits and optionally formatted as 8-4-4-4-12 digits. So a regex for a valid is uuid is '[a-fA-F0-9]{8}(-?)([a-fA-F0-9]{4}\1){3}[a-fA-F0-9]{12}' It will accept upper case hex digits but converts them to lower case. — Belayer, Oct 22 '19 at 00:49

Thorsten Kettner · Accepted Answer · 2019-10-21T14:23:03.260

Use regular expressions to extract the UUID and the name from the string. Then aggregate per UUID and either count distinct names or compare minimum and maximum name:

select
  substring(col, 'uuid=([[:alnum:]]+)') as uuid,
  string_agg(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid'), ' | ') as names
from mytable
group by substring(col, 'uuid=([[:alnum:]]+)')
having count(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid')) > 1;

Demo: https://dbfiddle.uk/?rdbms=postgres_12&fiddle=907a283a754eb7427d4ffbf50c6f0028

If you only want to count:

select
 count(*) as cnt_uuids,
 sum(num_names) as cnt_names,
 sum(num_rows) as cnt_rows
from
(
 select
 count(*) as num_rows,
 count(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid')) as num_names
 from mytable
 group by substring(col, 'uuid=([[:alnum:]]+)')
 having count(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid')) > 1
) flaws;

But as has been mentioned already: This is not how a database should be used.

score 1 · Answer 2 · answered Oct 21 '19 at 14:17

I assume you know all the reasons why this is a bad data format, but you are stuck with it. Here is my approach:

select v.user_id, array_agg(distinct names)
from (select v.id,
             max(el) filter (where n = un) as user_id,
             array_agg(el order by el) filter (where n in (un - 2, un - 1)) as names
      from (select v.id, u.*,
                   max(u.n) filter (where el like 'uuid=%') over (partition by v.id) as un
            from (values (1 , 'junkgoeshere;John;Smith;uuid=7c32e9e1-e29e-4211-b11e-e20b2cb78da9; ..'),
                         (2 , 'junkgoeshere;John;Smith;uuid=7c32e9e1-e29e-4211-b11e-e20b2cb78da9; ..'),
                         (3 , 'junkgoeshere;John;Smith;uuid=new_7c32e9e1-e29e-4211-b11e-e20b2cb78da9; ..'),
                         (4 , 'junkgoeshere;John;Jay;uuid=new_7c32e9e1-e29e-4211-b11e-e20b2cb78da9; ..')
                 ) v(id, str) cross join lateral
                 unnest(regexp_split_to_array(v.str, ';')) with ordinality u(el, n)
           ) v
      where n between un - 2 and un
      group by v.id
     ) v
group by user_id
having min(names) <> max(names);

Here is a db<>fiddle.

This assumes that the fields are separated by semicolons. Your data format is just awful, not just as a string but because the names are not identified. So, I am assuming they are the two fields before the user_id field.

So, this implements the following logic:

Breaks up the string by semicolons, with an identifying number.
Finds the number for the user_id.
Extracts the previous two fields together and the user_id column.
Then uses aggregation to find cases where there are multiple matches.

tricky SQL with substrings

2 Answers2