0

There is a simplified table mytable with columns ('id', 'mycolumn') of int and varchar(255) respectively.

What's the fastest way to find a row with the longest common right part (the longest tail) for current string in the mycolumn? The table is huge.

As example, for the string "aaabbbkr84" we'd find 2 - "wergerbkr84" by the longest found tail "bkr84":

1 - "gbkyugbk9"
2 -  "wergerbkr84"
3 - "gbkyugbk4".

To increase the performance, my thoughts were to create another column with the reverted strings. However I'm not sure if it helps and there is no better way. That way I would look up (until I get 0 results to return to the previous one) to look up for

SELECT id, mycolumn FROM mytable
where reverted like '4%';
SELECT id, mycolumn FROM mytable
where reverted like '48%';
SELECT id, mycolumn FROM mytable
where reverted like '48r%';
SELECT id, mycolumn FROM mytable
where reverted like '48rk%';
SELECT id, mycolumn FROM mytable
where reverted like '48rkb%'; ' <-- looking for this one
SELECT id, mycolumn FROM mytable
where reverted like '48rkbb%'; ' 0 results.

0 results found, taking a step back:48rkb. Which means 2 - "wergerbkr84" is the answer.

1 query for to find a row by the longest tail would be preferable (I have a loop of queries above as you see). However, the performance is #1.

Thanks a lot.

Haradzieniec
  • 9,086
  • 31
  • 117
  • 212

1 Answers1

1

This may be a bit on the exotic side but it sounds like you have an exotic problem.

Have you tried storing the proper tree structure in SQL and doing lookups on it? For example, you could create a trie/prefix tree where there is an edge for each possible character choice among all strings in the database (going in reverse string order).

trie.png

See the above image for an example of a partially filled-in trie on the three strings in your question (labeled 1, 2, 3). Nodes store references to all strings that are encoded from the edge labels as you traverse from the root. For the search string "48", follow edges labeled '4' and '8' and you will get a reference to the string labeled 2 ("wergerbkr84"). A search proceeds until no edges are labeled with the next character in the search string, at which point the strings with longest matching tail are stored at the current node. Search time is O(n) where n is the length of the search string.

See the following: How to represent a data tree in SQL? Note you will have to consider storage for the edge labels, which is not required in most tree structures.

Also note that there are compressed/space-optimized versions of tries that may be worth investigating depending on your storage requirements.

Community
  • 1
  • 1
Matt Davis
  • 361
  • 1
  • 4
  • 10