2

I'm looking for a regular expression in MSSQL to omit values which 'start' with specific words. For eg., the expression should ignore any string starting with 'Mr' but pull any string which starts with 'MMR' or where 'Mr' is somewhere in the middle of the string:

Sample data:

Should be omitted: 'Mr Murphy Barrera 7329 Neque Road Zierikzee M15 6SF Tonga'

Should not be omitted: 'Murphy Barrera 7329 Neque Road Zierikzee M15 6SF Mr Tonga'

I used LIKE '[^Mr.]' but that omits everything which starts with M OR R, so it's not doing the trick.

I have a list of such words which need to be checked (eg. Ms, Dr, St etc), so I'd request you let me know how to tweak the regex.

Thanks in advance!

Community
  • 1
  • 1
Chipmunk_da
  • 467
  • 2
  • 9
  • 27
  • [ ] says one of character between them, so [^Mr] will ommits M or r. Try to use ( ) – Michał M Oct 05 '16 at 09:54
  • Hi @Michal, I tried that and {} as well, but these don't work either. – Chipmunk_da Oct 05 '16 at 09:59
  • 1
    SQL Server doesn't have regular expressions. You should probably use full text search if you want acceptable performance, although the best solution would be to extract the prefix during loading. It's far easier to use a Regex in the C# program or an SSIS flow that loads the data into the table – Panagiotis Kanavos Oct 05 '16 at 10:02
  • Thanks @Panagiotis, I completely agree with you on this, languages like C# and Python are much better at handling this. But since I need a relatively quick solution and since the data is already sitting in a static SQL table (of ~200k rows), I think it's best to use a simple SQL script for now. – Chipmunk_da Oct 05 '16 at 10:40
  • 2
    It's not a matter of language. The best place to perform text processing is during loading. The stored data should be ready for querying. That's especially true when you have a large number of rows – Panagiotis Kanavos Oct 05 '16 at 10:49

2 Answers2

4

Why don't You just use

string not like 'mr%' ?

It will ignore the queries, starting with Mr.

  • 2
    Thanks @Tarlan, sometimes the mind tries to complicate things so much that it completely ignores the fact that there could be simpler ways of doing things! Thanks for making me realise that, I've just used your suggestion for now as a quick fix. Perhaps later, to make the query more efficient, I'll use the contain clause. – Chipmunk_da Oct 05 '16 at 10:36
1
(string like 'mmr%' or string like '%mr%') and string not like 'mr%'

Query using total wild cards can also use statistics,there by good estimates as well,only

1.If the total length of string is less than 80 characters

If string is longer than 80 characters then,

the first and last 40 characters are extracted from the string and concatenated prior to considering the string in the creation of the string summary. Hence, accurate frequency estimates for substrings that appear only in the ignored portion of a string are not available

I have a list of such words which need to be checked (eg. Ms, Dr, St etc), so I'd request you let me know how to tweak the regex

I recommend using full text search.To know why read below

Performance of like '%Query%' vs full text search CONTAINS query

Community
  • 1
  • 1
TheGameiswar
  • 27,855
  • 8
  • 56
  • 94
  • 1
    The performance of such a query will be horrible, as it will have to process all records. – Panagiotis Kanavos Oct 05 '16 at 10:00
  • Thanks @TheGameiswar, this is a helpful suggest. Though I only have to look through 200k rows of a 140 character field, it's always good to know how to do things better! – Chipmunk_da Oct 05 '16 at 10:31
  • @TheGameiswar @Panagiotis: Plus I'm actually after some thing like the following and I don't think I could use a full text search CONTAINS query for this: `(T.BeneAddress LIKE M.MatchIdentifier + '[^A-Za-z0-9]%' and T.BeneAddress not like 'Mr%' and T.BeneAddress not like 'Ms%') `-- _Match any addr which contains a word that starts with the MatchIdentifier and then ends with a special character (i.e. Country matchidentifier is at strt of Addr). BUT Addr shud not start with Mr/Ms - to avoid false positives since there are country matchidentifiers that are 'MR' for Mauritiana/'MS' Montserrat etc_ – Chipmunk_da Oct 05 '16 at 11:02