1

I have in a table called institutions containing the names of institutions. Each time a new name come in, a function needs to find any matches with respect to the existing institutions. For instance, we have the following institutions table:

NAME
FORMULA VIEW UNIVERSITY RESEARCH
FOURMULA VULCAN COLLEGE INSTITUTE
MOUNT VU FOOD GROWERS
FORMULA VU CAFE SHOP

The following synonyms table:

WORD            SYN_LIST
EDUCATION       SCHOOL, UNIVERSITY, COLLEGE, TRAINING
RESTAURANT      BAR, BISTRO, BREAKFAST, BUFFET, CABARET, CAFE, FOOD, GALLEY, GASTROPUB, GRILL
STORE           CONVENIENCE, FOOD, GROCER, GROCERY, MARKET, MART, SHOP, STORE, VARIETY
REFRIGERATION   APPLIANCE, COLDSTORAGE, FOODLOCKER, FREEZE, FRIDGE, ICE, REFRIGERATION

Finally, the word substitution table:

WORD        SUBS_LIST
MOUNTAIN    MOUNTAIN, MOUNT, MT, MTN
VIEW        VIEW, VU
FORMULA     FORMULA, 4MULA, FOURMULA

In top of that, the name from institutions is classified as name|field. To accomplish the classification of a word as name|field is just needed to pass manually to the function using two input parameters. For instance, the name FORMULA VIEW UNIVERSITY RESEARCH is decomposed as FORMULA VIEW as name and UNIVERSITY RESEARCH as field. When searching for a name in institutions, the name component (in this case FORMULA VIEW has to find an exact match) and at least one of the field words (in this case UNIVERSITY or/and RESEARCH).

The previous match rule would look like as follows:

select
from institutions
where name like '%FORMULA VIEW%' || ' %UNIVERSITY% '
UNION  
select
from institutions
where name like '%FORMULA VIEW%' || ' %RESEARCH% '

Then, if a new name comes in such as FORMULA VIEW COLLEGE would be enough to find match with FORMULA VIEW UNIVERSITY RESEARCH.

Finally, the synonym and word substitutions apply to the received name and each word can be in more than one category in the synonym table (as the word food is included in store and restaurant), then a varray could be needed in the function to find matches. Considering the previous three tables consider a new name comes such as MOUNTAIN VIEW STORE OFFICE, then the function finds MOUNT VU FOOD GROWERSalready in institutions table as a match. This match occurs because food has two synonyms, then two words are generated: MOUNTAIN VIEW FOOD GROWERS and MOUNTAIN VIEW STORE GROWERS.

select
from institutions
where name like '%MOUNTAIN VIEW%' || ' %STORE% '
UNION
select
from institutions
where name like '%MOUNTAIN VIEW%' || ' %OFFICE% '

Even though just STORE is a match and not OFFICE, this is enough to identify as a match and indicate a conflict in STORE.

I have been working on this problem for a while and created this DBFiddle to include most of the part of the previous explanation and attempting to solve the problem, but I have not been able to figure out how add the logic of searching for exact match in name category and pass one field category until passing the remaining field categories like in the example:

select
from institutions
where name like '%MOUNTAIN VIEW%' || ' %STORE% '
UNION
select
from institutions
where name like '%MOUNTAIN VIEW%' || ' %OFFICE% '

Please let me know if you find another solution more suitable to approach this problem. I hope I was clear enough, please let me know any questions.

John Barton
  • 1,581
  • 4
  • 25
  • 51

2 Answers2

1

Don't store data as a delimited string; use separate rows for each item:

CREATE TABLE synonyms (
  id          NUMBER(20,0)
              GENERATED ALWAYS AS IDENTITY
              PRIMARY KEY,
  word        VARCHAR(50)
              NOT NULL,
  replacement VARCHAR(50)
              NOT NULL,
  UNIQUE ( replacement )
);

You can write a PL/SQL function to split a string.

Then if you have a collection of words you can compare then in a query:

create or replace function replace_names(
  i_sentence IN VARCHAR2
) return string_list
IS
  p_words    string_list := split_string( i_sentence, ' ' );
  p_replaced string_list;
BEGIN
  SELECT COALESCE( s.word, w.word )
  BULK COLLECT INTO p_replaced
  FROM   synonyms s
         RIGHT OUTER JOIN (
           SELECT ROWNUM AS rn,
                  COLUMN_VALUE AS word
           FROM   TABLE( p_words )
         ) w
         ON ( s.replacement = w.word )
  ORDER BY w.rn;

  RETURN p_replaced;
END;
/

Your tests then become:

select * from TABLE( replace_names('MOUNT VU FOOD GROWERS' ) );

outputs:

| COLUMN_VALUE |
| :----------- |
| MOUNTAIN     |
| VIEW         |
| FOOD         |
| GROWERS      |

and

SELECT name,
       column_value corrected_name
FROM   institutions i
       CROSS JOIN TABLE(replace_names(i.name));

outputs:

NAME                             | CORRECTED_NAME
:------------------------------- | :-------------
FORMULA VIEW UNIVERSITY          | FORMULA       
FORMULA VIEW UNIVERSITY          | VIEW          
FORMULA VIEW UNIVERSITY          | EDUCATION     
FOURMULA VULCAN COLLEGE          | FORMULA       
FOURMULA VULCAN COLLEGE          | VULCAN        
FOURMULA VULCAN COLLEGE          | EDUCATION     
MOUNTAIN VIEW RESTAURANT GROWERS | MOUNTAIN      
MOUNTAIN VIEW RESTAURANT GROWERS | VIEW          
MOUNTAIN VIEW RESTAURANT GROWERS | FOOD          
MOUNTAIN VIEW RESTAURANT GROWERS | GROWERS       
FORMULA VU CAFE                  | FORMULA       
FORMULA VU CAFE                  | VIEW          
FORMULA VU CAFE                  | FOOD          

and:

SELECT name,
       column_value AS corrected_name
FROM   institutions i
       CROSS JOIN TABLE( replace_names( i.name ) )
WHERE  column_value MEMBER OF replace_names('MOUNTAIN VU FOOD'); 

outputs:

NAME                             | CORRECTED_NAME
:------------------------------- | :-------------
FORMULA VIEW UNIVERSITY          | VIEW          
MOUNTAIN VIEW RESTAURANT GROWERS | MOUNTAIN      
MOUNTAIN VIEW RESTAURANT GROWERS | VIEW          
MOUNTAIN VIEW RESTAURANT GROWERS | FOOD          
FORMULA VU CAFE                  | VIEW          
FORMULA VU CAFE                  | FOOD          

and:

SELECT name,
       ( SELECT LISTAGG( column_value, ' ' ) WITHIN GROUP ( ORDER BY ROWNUM )
         FROM   TABLE( corrected_words )
       ) AS corrected_name,
       num_matches
FROM   (
  SELECT name,
         replace_names( name ) AS corrected_words,
         CARDINALITY( replace_names( name ) MULTISET INTERSECT replace_names('MOUNTAIN VU FOOD') ) AS num_matches
  FROM   institutions
)
WHERE  num_matches > 0
ORDER BY num_matches DESC; 

outputs:

NAME                             | CORRECTED_NAME             | NUM_MATCHES
:------------------------------- | :------------------------- | ----------:
MOUNTAIN VIEW RESTAURANT GROWERS | MOUNTAIN VIEW FOOD GROWERS |           3
FORMULA VU CAFE                  | FORMULA VIEW FOOD          |           2
FORMULA VIEW UNIVERSITY          | FORMULA VIEW EDUCATION     |           1

db<>fiddle here

MT0
  • 143,790
  • 11
  • 59
  • 117
  • Thanks @MT0 for your help, I have had a change in the problem that now requires me to use Postgresql, I have tried to replicate your solution in Postgresql, but I am not able to run successfully the last two queries of your solution. This is the db<>fiddle I have tried converting to Postgresql: https://dbfiddle.uk/?rdbms=postgres_11&fiddle=c44789f9f8535a9ad90819680d48ece7 – John Barton Nov 29 '19 at 01:14
  • @JuanPerez [Ask a new question](https://stackoverflow.com/questions/ask) rather than trying to change the scope of an existing one in comments. – MT0 Nov 29 '19 at 07:46
0

Your problem is your data model.

  1. If the first two words are the name and the second two words the field, why then do you store these in one column and not two?
  2. In a relational database you shouldn't store comma-separated strings. (At least as long as you want to work with the separate parts.)

Then, there is no big difference between synonyms and substitutions, for both boil down to "different words mean about the same thing".

Let's simplify the data model. (You don't have to do this to the following extent and stay with separate synonym and substitution tables, abeit altered, and even a concatenated institution string, if you want to):

institution

name            | field 
----------------+--------------------
FORMULA VIEW    | UNIVERSITY RESEARCH
FOURMULA VULCAN | COLLEGE INSTITUTE
MOUNT VU        | FOOD GROWERS
FORMULA VU      | CAFE SHOP
word_grp

grp             | word
----------------+--------------------
EDUCATION       | SCHOOL
EDUCATION       | UNIVERSITY
EDUCATION       | COLLEGE
EDUCATION       | TRAINING
...             | ..
MOUNTAIN        | MOUNTAIN
MOUNTAIN        | MOUNT
MOUNTAIN        | MT
MOUNTAIN        | MTN
...             | ..

For string replacement you can write a function that normalizes the string by replacing the words with their group words:

create or replace function normstring(in_str varchar2) return varchar2 is
  i integer := 1;
  v_str varchar2(4000) := in_str;
  v_grp varchar2(100);
  v_word varchar2(100);
begin
  while regexp_substr(v_str, '\w+', 1, i) is not null loop
    v_word := regexp_substr(v_str, '\w+', 1, i);
    select max(grp) into v_grp from word_grp where word = v_word;
    if v_grp is not null then
      v_str := replace(v_str, v_word, v_grp);
    end if;
    i := i + 1;
  end loop;
  return v_str;
end normstring;

Then for a new name and field you can look up the table using above function. If there is a row where both name and field match, this comes first. Then come all rows where only the name matches. If there isn't even a match on the name, then no row is returned. The query is very simpel:

select *
from institution i 
where normstring(i.name) = normstring(:name)
order by
  d.name, d.field, case when normstring(i.field) = normstring(:field) then 1 else 2 end;

Demo: https://dbfiddle.uk/?rdbms=oracle_11.2&fiddle=9d17aa5fdfd7bb04d78f4087792b3bce

Thorsten Kettner
  • 89,309
  • 7
  • 49
  • 73