I have a solution in Java for a simply defined problem, but I want to improve the time needed to execute the data handling. The problem is to take a series of words held in a column on a relational database and split the words into pairs which are then insert into a dictionary of pairs. The pairs themselves relate to a product identified by partid.
Hence the Part table has
PartID (int), PartDesc (nvarchar)
and the dictionary has
DictID (int), WordPair (nvarchar).
The logic is therefore:
insert into DictPair (wordpair, partid)
select wordpairs, partid from Part
A wordpair is defined as two adjacent words and hence words will be repeated, eg
red car with 4 wheel drive
will pair to
{red, car},{car, with}, {with,4}, {4, wheel}, {wheel, drive}
Hence the final dictionary for say partid 45 will have (partid, dictionarypair):
45, red car
45, car with
45, with 4
45, 4 wheel
45, wheel drive
This is used in product classification and hence word order matters (but pair order does not matter).
Has anyone any ideas on how to solve this? I was thinking in terms of stored procedures, and using some kind of parsing. I want the entire solution to be implemented in SQL for efficiency reasons.