1

I have a solution in Java for a simply defined problem, but I want to improve the time needed to execute the data handling. The problem is to take a series of words held in a column on a relational database and split the words into pairs which are then insert into a dictionary of pairs. The pairs themselves relate to a product identified by partid.

Hence the Part table has

PartID (int), PartDesc (nvarchar) 

and the dictionary has

DictID (int), WordPair (nvarchar).

The logic is therefore:

insert into DictPair (wordpair, partid)
select wordpairs, partid from Part

A wordpair is defined as two adjacent words and hence words will be repeated, eg

red car with 4 wheel drive 

will pair to

{red, car},{car, with}, {with,4}, {4, wheel}, {wheel, drive}

Hence the final dictionary for say partid 45 will have (partid, dictionarypair):

45, red car
45, car with
45, with 4
45, 4 wheel
45, wheel drive

This is used in product classification and hence word order matters (but pair order does not matter).

Has anyone any ideas on how to solve this? I was thinking in terms of stored procedures, and using some kind of parsing. I want the entire solution to be implemented in SQL for efficiency reasons.

Trevor Oakley
  • 438
  • 5
  • 17

1 Answers1

3

Basically, find a split() function on the web that returns the position of a word in a string.

Then do:

select s.word, lead(s.word) over (partition by p.partId order by s.pos) as nextword
from parts p outer apply
     dbo.split(p.partDesc, ' ') as s(word, pos);

This will put NULL for the last pair, which you don't seem to want. So:

insert into DictPair (wordpair, partid)
    select word + ' ' nextword, partid, 
    from (select p.*, s.word, lead(s.word) over (partition by p.partId order by s.pos) as nextword
          from parts p outer apply
               dbo.split(p.partDesc, ' ') as s(word, pos)
         )
    where nextword is not null;

Here are some split functions, provided by Googling "SQL Server split". And another. And from StackOverflow. And there are many more

Community
  • 1
  • 1
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786