Process distinct comma-separated value in oracle

Question

I have a column with the following data:

Brand
-------------
Audi, Opel, Ford
Skoda, Renault
Audi, BMW
Audi, Volkswagen, Opel
Toyota, Hyundai

I would like to have query which automates assign the data into group as following:

Brand
-------------------
Audi, Opel, Ford, BMW, Volkwagen
Skoda, Renault
Toyota, Hyundai

Note that if we insert another record into the table like this ...

Toyota, BMW

... the required output would be:

Brand
-------------------
Audi, Opel, Ford, BMW, Volkwagen, Toyota, Hyundai
Skoda, Renault

If any element is both in two groups, then these groups are merged — mno013, Aug 16 '20 at 10:33
if an element is present in more than one group, what is the expected output — George Joseph, Aug 16 '20 at 10:38
If element is more in one group, then the groups are merged without repetition — mno013, Aug 16 '20 at 10:46
What I believe @GordonLinoff meant was don't store data as comma-separated values in a single column. This data arrangement implies the need for a child table in a normalized model. — pmdba, Aug 16 '20 at 10:48
So if we add a row `Toyota, BMW`, the row `Toyota, Hyundai` stops being a group of its own and becomes part of the `Audi` grouping, `Audi, Opel, Ford, BMW, Volkwagen, Toyota, BMW`. — APC, Aug 16 '20 at 13:58
Yeah, sorry, cut'n'paste error. What I meant was a grouping of `Audi, Opel, Ford, BMW, Volkwagen, Toyota, Hyundai`. — APC, Aug 16 '20 at 15:55

score 1 · Answer 1 · 2020-08-17T22:58:58.350

This is an interesting and difficult problem, obscured by your poor data model (which violates First Normal Form). Normalizing the data - and de-normalizing at the end - is trivial, it's just an annoyance (and it will make the query much slower). The interesting part: the input groups are the nodes of a graph, two nodes are connected if they have a "make" in common. You need to find the connected components of the graph; this is the interesting problem.

Here is a complete solution (creating the testing data on the fly, in the first factored subquery in the with clause). Question for you though: even assuming that this solution works for you and you put it in production, who is going to maintain it in the future?

EDIT It occurred to me that my original query can be simplified. Here is the revised version; you can click on the Edited link below the answer if you are curious to see the original version.

with
  sample_data (brand) as (
    select 'Audi, Opel, Ford'       from dual union all
    select 'Skoda, Renault'         from dual union all
    select 'Audi, BMW'              from dual union all
    select 'Audi, Volkswagen, Opel' from dual union all
    select 'Toyota, Hyundai'        from dual union all
    select 'Tesla'                  from dual
  )
, prep (id, brand) as (
    select rownum, brand 
    from   sample_data
  )
, fnf (id, brand) as (
    select p.id, ca.brand
    from   prep p  cross apply
           ( select  trim(regexp_substr(p.brand, '[^,]+', 1, level)) as brand
             from    dual
             connect by level <= regexp_count(p.brand, '[^,]+')
           ) ca
  )
, g (b1, b2) as (
    select distinct fnf1.brand, fnf2.brand
    from   fnf fnf1 join fnf fnf2 on fnf1.id = fnf2.id
  )
, cc (rt, brand) as (
    select  min(connect_by_root b1), b2
    from    g
    connect by nocycle b1 = prior b2
    group   by b2
  )
select listagg(brand, ', ') within group (order by null) as brand
from   cc
group  by rt;

Output:

BRAND                                        
---------------------------------------------
Audi, BMW, Ford, Opel, Volkswagen
Hyundai, Toyota
Renault, Skoda
Tesla

I just dont understand the result that OP wants. Congratz to have understood it and come with that solution. You really are a math guy ;) — Thomas G, Aug 16 '20 at 21:29
@ThomasG - The OP wants to consolidate his lists (which appear as comma-separated strings in his input): if two such "lists" have at least one "brand" in common, consolidate the two lists into a single one. Continue until the remaining list are pairwise disjoint. — , Aug 16 '20 at 21:36

Sayan Malakshinov · Answer 2 · 2020-08-17T01:54:01.890

0

That is standard Connected components problem. You can find fast pl/sql solution for production use here: http://orasql.org/2017/09/29/connected-components/

Or in case of just educational purposes, you can use SQL-only solution: https://gist.github.com/xtender/b6e5cac4dec461c0121145b0e62c5cf5

with t(Brand) as (
    select 'Audi, Opel, Ford' brand from dual union all
    select 'Skoda, Renault'         from dual union all
    select 'Audi, BMW'              from dual union all
    select 'Audi, Volkswagen, Opel' from dual union all
    select 'Toyota, Hyundai'        from dual union all
    select 'Tesla'                  from dual union all
    select 'A'||level||', A'||(level+1)   from dual connect by level<=500 union all
    select 'B'||level||', B'||(level+1)   from dual connect by level<=500 union all
    select 'C'||level||', C'||(level+1)   from dual connect by level<=500
)
,split_tab as (
   select
      dense_rank()over(order by t.brand) rn
     ,x.*
   from t,
       xmltable(
         'ora:tokenize(concat(",",.),",")[position()>1]'
         passing t.brand
         columns 
             n for ordinality
            ,name varchar2(20) path 'normalize-space(.)' 
       ) x
)
,pairs as (
   select 
      t1.rn, t1.name name1, t2.name name2
   from split_tab t1
       ,split_tab t2
   where t1.rn=t2.rn
)
select listagg(x,',')within group(order by x) 
from (
   select x, min(root) grp
    from (
    select distinct connect_by_root(name1) root, name1 x
    from pairs
    connect by nocycle 
         prior name1 = name2
    )
   group by x
)
group by grp
/

PS. I've split my solution into smallest possible steps, so you can check each CTE separately step-by-step to view how to get results.

edited Aug 17 '20 at 01:54

answered Aug 16 '20 at 21:33

Sayan Malakshinov

8,492
1
19
27

I haven't tested your PL/SQL solution. The SQL solution must be written with more care; for example, it will crash if one of the input strings only contains one brand (no commas). – Aug 16 '20 at 21:58
Just to see what your query does *other than* splitting the input lists with `ora:tokenize` (which causes the crash I mentioned in my first comment), I replaced the `split_tab` CTE with the (inefficient) one from my answer, using `regexp`. When I add a "list" consisting of a single brand, the splitting works correctly; but the rest of your query does not include that row in the output, it simply ignores it altogether. – Aug 16 '20 at 23:45
@mathguy 1) as a math guy you should know about "boundary conditions". They were not stated by OP, so I wonder why do you ask about them. 2) as a math guy, you should know that pure SQL solution has an exponential time complexity (even higher than O(n!)), so what you are suggesting for production use is not really suitable for production. 3) be polite: it's rude to use "must" here. – Sayan Malakshinov Aug 16 '20 at 23:57
In general, when the requester doesn't explicitly mention special cases, it is part of "our" job to ask for clarification. Besides, in this case strings with only one token are not the "boundary" cases; the boundary case is the null input (or a string with only commas, all tokens are null). Then: I doubt that in a problem having to do with car brands execution time will be meaningful. Besides, the execution time will depend on the expected size of connected components, among other things. But in any case, I gave no opinion on which solution will be faster; my point was only about correctness. – Aug 17 '20 at 00:03
I took a closer look at your "connect by" solution to finding connected components. It is massively inefficient (by several orders of magnitude), because it takes the wrong approach. Take a look at the solution in my Answer, and let me know if you are interested in discussing the topic further; we can find the right forum. I believe I first wrote up the approach here: https://community.oracle.com/message/14968482#14968482 (although I am sure I only reinvented the wheel). A variation of the same idea is here: https://community.oracle.com/message/15441405#15441405 I don't recall which is better. – Aug 17 '20 at 02:06
Or it seems this one was earlier... https://stackoverflow.com/questions/52629059/flag-individuals-that-share-common-features-with-oracle-sql – Aug 17 '20 at 02:11
I wrote a lot of different solutions for this problem many years ago. You can find my more efficient solutions on pure SQL and much more effective on plsql in the book "Power of Oracle sql" by Alex Reprintzev: https://sqlmdx.net/2011/03/15/tpoos/ and recheck updated version here. Also check pl/sql solutions. You can email me to sayan@orasql.org if you want to discuss this topic. – Sayan Malakshinov Aug 17 '20 at 02:22
And again: as I wrote here http://orasql.org/2014/02/28/straight-sql-vs-sql-and-plsql/ pl/sql – Sayan Malakshinov Aug 17 '20 at 02:25
There are many different cases when pl/sql with sql can be more efficient than only sql, and connected components problem is one of them. – Sayan Malakshinov Aug 17 '20 at 02:26

Process distinct comma-separated value in oracle

2 Answers2