0

Lets say i have a table with the following data.

+-------------------------------+
|   UniqueID  Name    Dataset   |
+-------------------------------+
| 1         ABC1    A:B;C:D;E:F |
| 2         ABC2    A:B;C:D;R:S |
| 3         ABC3    C:4;G:5;A:B |
| 4         ABC4    A:B;C:D;E:F |
+-------------------------------+

where dataset is combinations of data e.g. A:B, delimited by ;

Effectively what i want to do do is compare each set of data against the other record and build a picture up like the following by comparing 'Dataset' to get result like this (this is just the first comparison)

+--------------------------------------------------------------------------+
| UniqueID  Name    UniqueID   Name    Matched on  OnlyinBase OnlyinTarget |
+--------------------------------------------------------------------------+
| 1         ABC1    2          ABC2    A:B;C:D       E:F           R:S     |
| etc                                                                      |
+--------------------------------------------------------------------------+

What is the best way to do the above?

KP65
  • 13,315
  • 13
  • 45
  • 46

2 Answers2

1

Alternative solution in one query:

with 
  -- sample data
  t(Id, Name, Dataset) as (
    select 1, 'ABC1', 'A:B;C:D;E:F' from dual union all
    select 2, 'ABC2', 'A:B;C:D;R:S' from dual union all
    select 3, 'ABC3', 'C:4;G:5;A:B' from dual union all
    select 4, 'ABC4', 'A:B;C:D;E:F' from dual ),
  -- end of sample data
  q as (
    select distinct id, name,
           trim(regexp_substr(t.dataset, '[^;]+', 1, ls.column_value)) as ds
      from t, table(cast(multiset(select level from dual 
                                  connect by level <= length(regexp_replace(t.dataset, '[^;]+'))+1) 
                         as sys.odcinumberlist)) ls),
  p as (select q1.id id1, q1.name name1, q2.id id2, q2.name name2, q1.ds set1, q2.ds set2, 
               max(case when q1.ds = q2.ds then 1 else 0 end) 
                   over (partition by q1.id, q2.id, q1.ds) m1, 
               max(case when q1.ds = q2.ds then 1 else 0 end) 
                   over (partition by q1.id, q2.id, q2.ds) m2
          from q q1 join q q2 on q1.id <> q2.id),
  a1 as (select distinct id1, id2, set1 ds from p where m1 = 0),
  a2 as (select distinct id1, id2, set1 ds from p where m1 = 1),
  a3 as (select distinct id1, id2, set2 ds from p where m2 = 0)
select t1.id id1, t1.name name1, t2.id id2, t2.name name2, 
       (select listagg(ds, ' ; ') within group (order by ds) 
          from a1 where id1 = t1.id and id2 = t2.id) l1, 
       (select listagg(ds, ' ; ') within group (order by ds) 
          from a2 where id1 = t1.id and id2 = t2.id) l2, 
       (select listagg(ds, ' ; ') within group (order by ds) 
          from a3 where id1 = t1.id and id2 = t2.id) l3
  from t t1
  join t t2  on t1.id <> t2.id;

Result:

   ID1 NAME1    ID2 NAME2 L1            L2                L3
------ ----- ------ ----- ------------  ----------------  -------------
     1 ABC1       2 ABC2  E:F           A:B ; C:D         R:S
     1 ABC1       3 ABC3  C:D ; E:F     A:B               C:4 ; G:5
     1 ABC1       4 ABC4                A:B ; C:D ; E:F
     2 ABC2       1 ABC1  R:S           A:B ; C:D         E:F
     2 ABC2       3 ABC3  C:D ; R:S     A:B               C:4 ; G:5
     2 ABC2       4 ABC4  R:S           A:B ; C:D         E:F
     3 ABC3       1 ABC1  C:4 ; G:5     A:B               C:D ; E:F
     3 ABC3       2 ABC2  C:4 ; G:5     A:B               C:D ; R:S
     3 ABC3       4 ABC4  C:4 ; G:5     A:B               C:D ; E:F
     4 ABC4       1 ABC1                A:B ; C:D ; E:F
     4 ABC4       2 ABC2  E:F           A:B ; C:D         R:S
     4 ABC4       3 ABC3  C:D ; E:F     A:B               C:4 ; G:5
12 rows selected

Subbquery q uses one of the splitting techniques from SO to divide words into separate rows. Then I self-joined data and counted matching / non-matching words. Subqueries a1 - a3 are needed only because function listagg does not respect distinct clause.

This solution compares (1 and 4) and (4 and 1). You can change this to show results only once by replacing <> in t1.id <> t2.id and q1.id <> q2.id with <.

Ponder Stibbons
  • 14,723
  • 2
  • 21
  • 24
0

I think your sample data set isn't correct.

  • Shouldn't for ID = 2 D be separated from R by semi-colon (not comma)?
  • Shouldn't for ID = 3 5 be separated from A by semi-colon (not colon)?

I fixed that (if it should be fixed) and wrote the following PL/SQL code; I don't know whether you can do that in pure SQL. Have a look, see if it helps.

What does it do? Using nested loops, splits all DATASETs into rows and - using SET operators (INTERSECT, MINUS) decides which group the result belongs to (matched / only in base / only in target).

SQL> select * from test;

        ID NAME DATASET
---------- ---- --------------------
         1 ABC1 A:B;C:D;E:F
         2 ABC2 A:B;C:D;R:S
         3 ABC3 C:4;G:5;A:B
         4 ABC4 A:B;C:D;E:F

SQL> set serveroutput on
SQL>
SQL> DECLARE
  2     l_matched   VARCHAR2 (20);
  3     l_base      VARCHAR2 (20);
  4     l_target    VARCHAR2 (20);
  5  BEGIN
  6     FOR cur_1 IN (  SELECT id, name, dataset
  7                       FROM test
  8                   ORDER BY id)
  9     LOOP
 10        FOR cur_2 IN (  SELECT id, name, dataset
 11                          FROM test
 12                         WHERE id > cur_1.id
 13                      ORDER BY id)
 14        LOOP
 15           -- Matched
 16           SELECT LISTAGG (col, ';') WITHIN GROUP (ORDER BY col)
 17             INTO l_matched
 18             FROM (    SELECT REGEXP_SUBSTR (cur_1.dataset,
 19                                             '[^;]+',
 20                                             1,
 21                                             LEVEL)
 22                                 col
 23                         FROM DUAL
 24                   CONNECT BY LEVEL <= REGEXP_COUNT (cur_1.dataset, ';') + 1
 25                   INTERSECT
 26                       SELECT REGEXP_SUBSTR (cur_2.dataset,
 27                                             '[^;]+',
 28                                             1,
 29                                             LEVEL)
 30                                 col
 31                         FROM DUAL
 32                   CONNECT BY LEVEL <= REGEXP_COUNT (cur_2.dataset, ';') + 1);
 33
 34           -- Only in base
 35           SELECT LISTAGG (col, ';') WITHIN GROUP (ORDER BY col)
 36             INTO l_base
 37             FROM (    SELECT REGEXP_SUBSTR (cur_1.dataset,
 38                                             '[^;]+',
 39                                             1,
 40                                             LEVEL)
 41                                 col
 42                         FROM DUAL
 43                   CONNECT BY LEVEL <= REGEXP_COUNT (cur_1.dataset, ';') + 1
 44                   MINUS
 45                       SELECT REGEXP_SUBSTR (cur_2.dataset,
 46                                             '[^;]+',
 47                                             1,
 48                                             LEVEL)
 49                                 col
 50                         FROM DUAL
 51                   CONNECT BY LEVEL <= REGEXP_COUNT (cur_2.dataset, ';') + 1);
 52
 53           -- Only in target
 54           SELECT LISTAGG (col, ';') WITHIN GROUP (ORDER BY col)
 55             INTO l_target
 56             FROM (    SELECT REGEXP_SUBSTR (cur_2.dataset,
 57                                             '[^;]+',
 58                                             1,
 59                                             LEVEL)
 60                                 col
 61                         FROM DUAL
 62                   CONNECT BY LEVEL <= REGEXP_COUNT (cur_2.dataset, ';') + 1
 63                   MINUS
 64                       SELECT REGEXP_SUBSTR (cur_1.dataset,
 65                                             '[^;]+',
 66                                             1,
 67                                             LEVEL)
 68                                 col
 69                         FROM DUAL
 70                   CONNECT BY LEVEL <= REGEXP_COUNT (cur_1.dataset, ';') + 1);
 71
 72           DBMS_OUTPUT.put_line (
 73                 cur_1.id
 74              || ' '
 75              || cur_1.name
 76              || ' '
 77              || cur_2.id
 78              || ' '
 79              || cur_2.name
 80              || ' '
 81              || rpad(l_matched, 20, ' ')
 82              || ' '
 83              || rpad(l_base, 20, ' ')
 84              || ' '
 85              || rpad(l_target, 20, ' '));
 86        END LOOP;
 87     END LOOP;
 88  END;
 89  /
1 ABC1 2 ABC2 A:B;C:D              E:F                  R:S
1 ABC1 3 ABC3 A:B                  C:D;E:F              C:4;G:5
1 ABC1 4 ABC4 A:B;C:D;E:F
2 ABC2 3 ABC3 A:B                  C:D;R:S              C:4;G:5
2 ABC2 4 ABC4 A:B;C:D              R:S                  E:F
3 ABC3 4 ABC4 A:B                  C:4;G:5              C:D;E:F

PL/SQL procedure successfully completed.

SQL>
Littlefoot
  • 131,892
  • 15
  • 35
  • 57
  • I fixed the sample data delimiters thanks for pointing out - reading through your answer now! – KP65 Jan 31 '19 at 10:12
  • This worked perfectly. I was wondering if there is a faster way to implement the regex as it doesn't seem to scale well... – KP65 Jan 31 '19 at 14:24
  • Well, for such an approach (sets, I mean), you have to split that string to rows. I guess that any other approach might be even slower. If possible, modify your data model so that it is normalized (as current one isn't - you shouldn't store different info in the same column). – Littlefoot Jan 31 '19 at 19:01
  • So i should create a procedure to split the data first...and then try the above....sorry i am no expert with pl/sql and i'm not a developer either.. – KP65 Feb 01 '19 at 09:11
  • Huh, right ... it would be easier if you were a developer :) When I said that "data model" isn't right, it means that you should create some more tables, drop (or recreate) current ones. I guess you won't be doing that. For one time only, yes - you might write some code to split data and store the result into some other table, but - if you're going to use such a code more than once, that won't be a good solution either. Unfortunately, you're paying for someone else's mistakes. – Littlefoot Feb 01 '19 at 09:16
  • Thanks! I will open another question as i have the data in a more normalised format...but struggling to implement the above with that new structure – KP65 Feb 01 '19 at 09:22
  • cheeky link to the new question...https://stackoverflow.com/questions/54476625/joining-2-data-sets-in-oracle-via-intersection-in-oracle-pl-sql – KP65 Feb 01 '19 at 10:06