Finding similar strings in tsql

Question

I am currently working on a logging system in sql server 2014, and need to find a way to assign similar strings into groups. Each string contains a prefix, an ID or Guid and then a suffix. I'd like to find a way find if there already exists strings with the same pre- and suffix in my table.

A typical string might look like:

'ApsisQueue.Synchronize @ApiKey = AA24FA75-CB78-451E-A11F-39115ABF8AA1, @CallbackId = 1'

Or

'ApsisQueue.Synchronize @ApiKey = AA24FA75-CB78-451E-A11F-39115ABF8AA1, LogMessage = NULL'

Or

'ApsisQueue.Synchronize @ApiKey = BC2CFC45-CB78-411E-A66F-3CC654AB125, LogMessage = NULL'

Or

'File id19957 does not exist'

Or

'File id1955 does not exist'

I'd like to group the messages depending on what appears before and after the Id/Guid values, so the messages above would be grouped into 3 groups.

I've considered using Levenshtein to find similar strings, but it seems a bit overkill since I know all differences will follow one another (it also causes performance issues due to the amount of strings that needs to be compared). The problem is that I don't know where in string the Guid/Id starts or ends.

Can you please add some sample data and explain along with it — TheGameiswar, Jul 20 '16 at 14:07
"The problem is that I don't know where in string the Guid/Id starts or ends." Neither do we. We don't even know what the string looks like. — Gordon Linoff, Jul 20 '16 at 14:10
Added a few examples of typical strings that would appear in the log — Kull, Jul 20 '16 at 14:25
So you have two distinct and exact patterns? Should be pretty easy to isolate using substring and charindex. — Sean Lange, Jul 20 '16 at 14:38

score 0 · Answer 1 · answered Jul 20 '16 at 15:30

Not really sure what you are trying to accomplish but I suspect that once you figure out how to isolate the key values you are working with you can move forward. This works with your sample data. If you have other patterns this may not work.

create table #Something
(
    SomeString varchar(200)
)

insert #Something (SomeString) values
('ApsisQueue.Synchronize @ApiKey = AA24FA75-CB78-451E-A11F-39115ABF8AA1, @CallbackId = 1')
,('ApsisQueue.Synchronize @ApiKey = AA24FA75-CB78-451E-A11F-39115ABF8AA1, LogMessage = NULL')
,('ApsisQueue.Synchronize @ApiKey = BC2CFC45-CB78-411E-A66F-3CC654AB125, LogMessage = NULL')
,('File id19957 does not exist')
,('File id1955 does not exist')


select *
    , Case when SomeString like 'ApsisQueue%' 
        then SUBSTRING(SomeString, charindex('=', SomeString) + 2, CHARINDEX(',', SomeString) - charindex('=', SomeString) - 2) 
        else left(SUBSTRING(SomeString, 8, 200), charindex(' ', SUBSTRING(SomeString, 8, 200)) - 1)
    end as YourKeyValue
from #Something

drop table #Something

Finding similar strings in tsql

1 Answers1