Oracle:Comparing comma separated values from two columns of two different tables

Question

I want to compare the values of two columns (diff table) having comma separated values of two different Oracle tables. I want to find rows that match with all values (NAME1 all values should match with NAME2 values).

Note: The comma separated values are in different order.

Example:

T1:
ID_T1             NAME1
===================================


1      ASCORBIC ACID, PARACETAMOL, POTASSIUM HYDROGEN CARBONATE
2      SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS, CITRIC ACID
3      CAFFEINE, PARACETAMOL PH. EUR.
4      PSEUDOEPHEDRINE HYDROCHLORIDE,DEXCHLORPHENIRAMINE MALEATE
5      PARACETAMOL, DEXTROMETHORPHAN, PSEUDOEPHEDRINE, PYRILAMINE

T2:

ID_T2          NAME2
=================================

 4      POTASSIUM HYDROGEN CARBONATE, ASCORBIC ACID, PARACETAMOL
 5      SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS
 6      PARACETAMOL PH. EUR.,CAFFEINE
 7      CODEINE PHOSPHATE, PARACETAMOL DC
 8      DEXCHLORPHENIRAMINE MALEATE, DEXTROMETHORPHAN HYDROBROMIDE 
10      DEXCHLORPHENIRAMINE MALEATE, PSEUDOEPHEDRINE HYDROCHLORIDE
11      PARACETAMOL, DEXTROMETHORPHAN, PSEUDOEPHEDRINE, PYRILAMINE1

MY RESULT should only show the matching row based on ALL NAME Matches in both tables.

ID_T1    ID_T2    MATCHING NAME
    ==================================
    1            4    POTASSIUM HYDROGEN CARBONATE, ASCORBIC ACID, PARACETAMOL
    3            6    PARACETAMOL PH. EUR.,CAFFEINE
    4           10    PSEUDOEPHEDRINE HYDROCHLORIDE,DEXCHLORPHENIRAMINE MALEATE

There is PARTIAL solution provided by existing member @Goran below, Below solution work for all values except last row. Below solution is finding a match for 5TH ROW of T1 Vs 11TH ROW of T2 Which is Wrong as the last row value of T2 is "PYRILAMINE1" which is <> to T1 Last row value "PYRILAMINE"

PARTIAL SOLUTION:

SELECT
    T1.ID_T1,
    T2.ID_T2,
    T1.NAME1
FROM
    T1
    JOIN T2 ON TRIM('#' FROM TRANSLATE(T1.NAME1, T2.NAME2, '#')) IS NULL
               AND TRANSLATE(T1.NAME1, T2.NAME2, '#') IS NOT NULL
               AND REGEXP_COUNT(T1.NAME1, ',') = REGEXP_COUNT(T2.NAME2, ',');

The so-called "partial solution" is 100% wrong. It will match, for example, the string `'CALCIUM CARBONATE, SODIUM NITRATE'` and the string `'CALCIUM NITRATE, SODIUM CARBONATE'`. Where did you find this so-called "solution"? — , Oct 06 '19 at 12:57
Question for clarification. I see that in most cases there is a space after each comma, but there are a few exceptions. Is that a typo on your part, or do you in fact observe that in the data? And if you do see such exceptions in the data, is it also possible that the "names" are not all-caps? Is it possible that there are typos in the names? All these anomalies may cause "incorrect" results in the query. Which shows just one of the many fatal weaknesses in the data model (which violates First Normal Form, the most basic requirement for a good data model). — , Oct 06 '19 at 13:05
@mathguy: 1) Yes, space is there in most of the values. 2) Most of names are in CAPS 3) This is what data model we have, it is not created by me. — Pramod Kumar, Oct 06 '19 at 13:58
Please do not post duplicate questions. If you don't like the response you got to [your previous question](https://stackoverflow.com/q/58237799/146325) please comment to the answers there and/or **edit your question** to explain what more you need. — APC, Oct 06 '19 at 14:23

score 0 · Answer 1 · answered Oct 06 '19 at 13:53

Here is one way, purely in SQL and not taking advantage of any features from Oracle 12 and higher (since you didn't tell us your version, I didn't want to make assumptions). If you find yourself needing this kind of comparison more than once, you would be better off writing a "normalize_string" function along the lines shown in the code below, and using it whenever needed. What it would do (and what the code below does) is to split the comma-separated string into tokens, trim spaces from the front and the end of each token, convert each token to upper-case (allowing for inputs with lower-case letters), de-duplicate tokens (allowing for the same token to appear more than once in the input), and then re-create the comma-separated list, with the tokens appearing in alphabetical order. This, then, allows comparison of the resulting strings to each other.

Even better, if it's within your powers (of either doing it yourself or influencing the bosses to do it), is to change the data model. You should have a separate, small table for individual ingredients, with a primary key (perhaps a number, certainly not the NAME of the ingredient) for reference from other tables. Then you need a separate, small table for drugs, also with a primary key (not the drug name!) And finally a many-to-many relationship table, with a row for each ingredient in each drug - both identified by their respective keys (not by names). This will avoid typos in drug or ingredient names, lower-case instead of upper-case, etc. and it would make all your code much easier to write, test and maintain, and a lot faster. Can you do that though?

= = = = = = =

In the output below I show the "name" as it appears in t1 (in your desired output you show the name as it appears in t2 but I assumed you don't care). Perhaps it would be better to show the normalized version instead (norm1 instead of name1).

select n1.id_t1, n2.id_t2, n1.name1
from
  ( select id_t1, name1, listagg(token, ', ') within group (order by token) norm1
    from   ( select  distinct id_t1, name1,
                     upper(trim(substr(str, instr(str, ',', 1, level) + 1,
                           instr(str, ',', 1, level + 1) 
                              - instr(str, ',', 1, level) - 1))) token
             from    (select id_t1, name1, ',' || name1 || ',' as str from t1)
             connect by  level <= length(str) - length(replace(str, ',')) - 1 
                     and prior id_t1 = id_t1
                     and prior sys_guid() is not null
           )
    group  by id_t1, name1
  ) n1
  inner join
  ( select id_t2, name2, listagg(token, ', ') within group (order by token) norm2
    from   ( select  distinct id_t2, name2, 
                     upper(trim(substr(str, instr(str, ',', 1, level) + 1,
                           instr(str, ',', 1, level + 1)
                              - instr(str, ',', 1, level) - 1))) token
             from    (select id_t2, name2, ',' || name2 || ',' as str from t2)
             connect by  level <= length(str) - length(replace(str, ',')) - 1 
                     and prior id_t2 = id_t2
                     and prior sys_guid() is not null
           )
    group  by id_t2, name2
  ) n2
  on n1.norm1 = n2. norm2
;

Output

 ID_T1  ID_T2 NAME1                                                             
------ ------ ------------------------------------------------------------------
     1      4 ASCORBIC ACID, PARACETAMOL, POTASSIUM HYDROGEN CARBONATE          
     3      6 CAFFEINE, PARACETAMOL PH. EUR.                                    
     4     10 PSEUDOEPHEDRINE HYDROCHLORIDE,DEXCHLORPHENIRAMINE MALEATE

Thanks a lot for your work, MUCH APPRECAITED. When i run this query with thousand of records then it is taking long. Can we use REGEXP also for better performance? These are list of ingredients only, not drug. A drug is composed of multiple ingredients. So i can have a table where for one DRUG_ID i have multiple INGREDIENT_ID. Is this what you are suggesting ? — Pramod Kumar, Oct 07 '19 at 02:08
@PramodKumar - I imagine each ID in each table is the ID of a drug, no? Yes, what I am suggesting is a many-to-many table, where one DRUG_ID may appear on three or five different rows, once for every ingredient it has (identified by INGREDIENT_ID). As for speed: REGEXP will make things slower, not faster. Standard string functions are often **much** faster than regular expressions. To make the query faster you can take advantage of features from Oracle 12 or higher, for example the LATERAL clause. So, what is your Oracle version? Run `SELECT * FROM V$VERSION` to find out. — , Oct 07 '19 at 02:52
Yes, they are ID of the DRUG. I got your point, yes the plan is to create such table. My Version is: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0. i am still working to run your query for thousands of records. I will let you know one once i make the table and run the query on that table. GREAT JOB MAN, YOU ARE DOING EXCELLENT JOB FOR YOUR COMMUNITY. — Pramod Kumar, Oct 07 '19 at 14:01

Oracle:Comparing comma separated values from two columns of two different tables

PARTIAL SOLUTION:

1 Answers1