1

I have a generic requirement to group data and get the common string out of it. I am not sure which technology will work for this specific requirement . Hence mentioned it as generic. I currently have access to SQL,Unix commands, SAS , MSTR, Informatica, and Python.

Actual Data:

Field A| Field B | Field c
A      | 001     | Alan J
A      | 001     | Alan Walker 
A      | 001     | Walker Alan 
A      | 001     | Alexander AlanJoseph 

Required Output:

Field A | Field B | Field C 
A       | 001     | Alan

Note: Based on grouping of Field A and Field B, need to find out the common pattern among the data in Field C

Could someone please help me out on this ?

Scott77
  • 21
  • 3
  • This is not a suitable problem for SQL. If you are looking for the longest common substring, then there might be appropriate libraries in an analytic tool/programming language such as SAS or Python. – Gordon Linoff Jun 18 '20 at 10:41
  • This post may give you some direction- https://stackoverflow.com/questions/49394121/common-substring-of-two-string-in-sql – mkRabbani Jun 18 '20 at 10:52

2 Answers2

1

Welcome to SAS. Your question can be done in many ways, I give you the DOW loop version.

data have;
    input A$ 1-3 B$ 12-15 C$ 21-43;
    cards;
    A      | 001     | Alan 
    A      | 001     | Alan Walker 
    A      | 001     | Walker Alan 
    A      | 001     | Alexander AlanJoseph
    B      | 002     | Jay 
    B      | 002     | Jay Zhou
    B      | 002     | JayJay 
;
run;

proc sort;
    by A B;
run;

data want(keep=A B C);
    length string $1024.;

    do until(last.B);
        set have;
        by A B;
        string = catx('@',string,C);
        count = sum(count,1);
    end;

    do until(last.B);
        set have;
        by A B;
        do i = 1 to count;
            if find(scan(string,i,'@'),cats(C)) then match = sum(match,1);
        end;
        if count = match then output;
    end;
run;

proc print;
run;

whymath
  • 1,262
  • 9
  • 16
  • Thanks for the response. I have liked your approach. I have updated my input data set for the first record, in that case what modification is needed in the SAS code to fetch the result? The output has to display the common word for the group of field A and field B. Thanks – Scott77 Jun 22 '20 at 03:00
  • @Scott77 This is very diffcult to me, perhaps I can not help. Maybe you can google it for "longest common substring". I had did it but the anwser were all about 2 strings, not for n(n>2). I will take a good think the several days. – whymath Jun 24 '20 at 12:12
1

Below is the solution,

Sql

select field_a, field_b, field_c from
(select distinct d1.*, row_number() over(partition by d1.field_a order by d1.field_c) 
as rnk from data d1
inner join data d2 on 
(instr(d1.field_c, d2.field_c) > 0)) where rnk=1;

https://dbfiddle.uk/?rdbms=oracle_18&fiddle=ccd9837b8e893ed42f8ade3b7e7e9d40

Informatica

Use Expression transformation, find the patter using INSTR function (you can also use regular expressions). Then the pass the ports to rank transformation and sort it be field_c and fetch top 1 values.

Jim Macaulay
  • 4,709
  • 4
  • 28
  • 53
  • To understand the SQL : search each string in each other `((instr(d1.field_c, d2.field_c) > 0)` and then keep the one you find most times in other values `where rnk=1` – Mickaël Bucas Jun 18 '20 at 17:13