4

I have noticed that PATINDEX on SQL Server (I'm using 2016) is giving odd results. I suspect it has something to do with collations and character sets.

I'm trying to get the index of the first space or hyphen using PATINDEX.

On a database with the default collation SQL_Latin1_General_CP1_CI_AS I get the expected results of 0 (not found):

select PATINDEX('%[ -]%',  'ABC') -- returns 0
select PATINDEX('%[ -]%', N'ABC') -- returns 0
select PATINDEX('%[- ]%', N'ABC') -- returns 0

However, on a database with collation Latin1_General_100_BIN I get an unexpected and wrong result indicating that either a space or hyphen was found at index 1:

select PATINDEX('%[ -]%',  'ABC') -- returns 0
select PATINDEX('%[ -]%', N'ABC') -- returns 1 (WRONG!)
select PATINDEX('%[- ]%', N'ABC') -- returns 0

In summary, I note this apparently wrong result when:

  • Collation Latin1_General_100_BIN
  • Searching a unicode string
  • Hyphen in pattern appears last.

I have seen other similar questions but they don't address quite the same situation, especially why the patteren works on one collation and not other, and OK on non-unicode string and not on unicode string. I have seen patindex t-sql special characters which says that the '-' character in any position other than first is a range spec for LIKE and PATINDEX (although I don't see it in the SQL Server PATINDEX or Wildcard docs). Still doesn't explain why it works in some configurations and not others as shown.

Why such different PATINDEX and apparently wrong result?

John
  • 173
  • 3
  • 13
  • 1
    I did report this as a bug in 2012. It was closed as "won't fix". Original link http://connect.microsoft.com/SQLServer/feedback/details/742841/pattern-syntax-undocumented-difference-in-behaviour-between-varchar-vs-nvarchar but doesn't look like this exists in the way back machine – Martin Smith Apr 17 '19 at 12:21
  • 1
    Response was *Greetings from the SQL Server team. Thanks for writing in to Microsoft. We greatly value your feedback. We were able to reproduce the bug that you mentioned. However all the patterns mentioned in the repro are invalid because SQL Server treats '-' as a special character and expects a start and end range. Ideally these invalid patterns should return NULL but the current behavior does not do so. We shall consider modifying the behavior for future release. Thanks again for providing feedback and making SQL Server the greatest Database server. Regards Atish Agrawal* – Martin Smith Apr 17 '19 at 12:25
  • I'd consider the chance that the behavior *actually* gets modified in a future release to be very close to zero (especially for it to ever return `NULL`!), because it could break existing systems that are counting on exactly this unintuitive behavior to happen, whereas the workarounds for it are fairly simple (if not necessarily obvious to get fully correct). Just in case the fact that nothing has happened since 2012 wasn't already a good indicator to not hold your breath. – Jeroen Mostert Apr 17 '19 at 12:40
  • Yep - could do with implementing proper regular expression support as a native function so we can just forget about `PATINDEX` – Martin Smith Apr 17 '19 at 12:48
  • Thank you everyone for your responses, they are much appreciated. Of all these I consider Martin Smith's comments the most accurate as they show Microsoft recognize this as a bug. i.e. The hyphen is a special character and not handled as an individual char for search. Therefore, all such uses of hyphen in PATINDEX should return an error rather than often coincidental results that they do. – John Apr 18 '19 at 10:23

2 Answers2

3

You already mentioned yourself that the '-' character in any position other than first is (or rather can be) a range spec. The issue with ' -' is that no end of the range is given. So lets find out what the end of the range is:

SELECT  SV.number, NCHAR(SV.number) TestChar
FROM    master..spt_values AS SV
WHERE   SV.type = 'p'
    AND NCHAR(SV.number) LIKE N'%[ -]%'

This returns:

+--------+----------+
| number | TestChar |
+--------+----------+
|     32 |          |
|     45 |        - |
+--------+----------+

So, in a non-binary collation (I'm using Latin1_General_CI_AS) the - is not interpreted as a range specifier at all, but as a literal character. Otherwise, the characters between 32 and 45 would also be returned. So only space and dash will return a patindex <> 0. Lets try a binary collation:

SELECT  SV.number, NCHAR(SV.number) TestChar 
FROM    master..spt_values AS SV
WHERE   SV.type = 'p'
    AND NCHAR(SV.number) LIKE N'%[ -]%' COLLATE Latin1_General_100_BIN2

This returns:

+--------+----------+
| number | TestChar |
+--------+----------+
|     32 |          |
|     33 | !        |
|     34 | "        |
|     35 | #        |
|     36 | $        |
|     37 | %        |
|     38 | &        |
|     39 | '        |
|     40 | (        |
|     41 | )        |
|     42 | *        |
|     43 | +        |
|     44 | ,        |
|     45 | -        |
|     46 | .        |
|     47 | /        |
|     48 | 0        |
|     49 | 1        |
|     50 | 2        |
|     51 | 3        |
|     52 | 4        |
|     53 | 5        |
|     54 | 6        |
|     55 | 7        |
|     56 | 8        |
|     57 | 9        |
|     58 | :        |
|     59 | ;        |
|     60 | <        |
|     61 | =        |
|     62 | >        |
|     63 | ?        |
|     64 | @        |
|     65 | A        |
|     66 | B        |
|     67 | C        |
|     68 | D        |
|     69 | E        |
|     70 | F        |
|     71 | G        |
|     72 | H        |
|     73 | I        |
|     74 | J        |
|     75 | K        |
|     76 | L        |
|     77 | M        |
|     78 | N        |
|     79 | O        |
|     80 | P        |
|     81 | Q        |
|     82 | R        |
|     83 | S        |
|     84 | T        |
|     85 | U        |
|     86 | V        |
|     87 | W        |
|     88 | X        |
|     89 | Y        |
|     90 | Z        |
|     91 | [        |
|     92 | \        |
|     93 | ]        |
+--------+----------+

So now it is interpreted as a range, and the range includes A-Z. Note it does not contain a-z! The lower case letters would be included when we use LIKE N'%[ -z]%'. In binary, the end of the range (when none is specified) is always ], no matter what the beginning of the range is.

Now, lets have a look at what non-unicode values do:

SELECT  SV.number, CHAR(SV.number) TestChar
FROM    master..spt_values AS SV
WHERE   SV.type = 'p'
    AND CHAR(SV.number) LIKE '%[ -]%' COLLATE Latin1_General_100_BIN2

This returns:

+--------+----------+
| number | TestChar |
+--------+----------+
|     32 |          |
|     45 |        - |
+--------+----------+

So, as ASCII the dash is again not interpreted as a range operator. Weird, huh?

By the way, if you really want to search for space dash, you can also use PATINDEX(N'% [-]%', N'ABC' COLLATE Latin1_General_BIN2).

Another by the way: If we check out Larnu's solution:

SELECT  SV.number, NCHAR(SV.number) TestChar
FROM    master..spt_values AS SV
WHERE   SV.type = 'p'
    AND CHAR(SV.number) LIKE '%[ --]%' COLLATE Latin1_General_100_BIN2

You'll get:

+--------+----------+
| number | TestChar |
+--------+----------+
|     32 |          |
|     33 | !        |
|     34 | "        |
|     35 | #        |
|     36 | $        |
|     37 | %        |
|     38 | &        |
|     39 | '        |
|     40 | (        |
|     41 | )        |
|     42 | *        |
|     43 | +        |
|     44 | ,        |
|     45 | -        |
+--------+----------+

So you're still evaluating a range. Not sure if that's what you want, but something to be aware of.

HoneyBadger
  • 14,750
  • 3
  • 34
  • 48
0

Double up the hyphen as it seems it's sometimes being used as a between operator.

SELECT PATINDEX(N'%[ --]%', 'ABC' COLLATE Latin1_General_100_BIN); --Returns 0
SELECT PATINDEX(N'%[ --]%', N'ABC' COLLATE Latin1_General_100_BIN); --Returns 0
SELECT PATINDEX(N'%[-- ]%', N'ABC' COLLATE Latin1_General_100_BIN); --Returns 0

SELECT PATINDEX(N'%[ --]%', '-ABC' COLLATE Latin1_General_100_BIN); --Returns 1
SELECT PATINDEX(N'%[ --]%', N'ABC-' COLLATE Latin1_General_100_BIN); --Returns 4
SELECT PATINDEX(N'%[-- ]%', N'-ABC' COLLATE Latin1_General_100_BIN); --Returns 0, as the hyphen is at the start, so doesn't need escaping.
Thom A
  • 88,727
  • 11
  • 45
  • 75
  • Thanks for the observation and possible workaround. It still doesn't explain why PATINDEX isn't behaving as it should. As for hyphen being a range operator, as a regex pattern in the context shown it should not be interpreted that way. https://regex101.com/ is a nice place to try these out. – John Apr 17 '19 at 08:07
  • @John `PATINDEX` doesn't support REGEX, so REGEX standards don't apply. SQL Server uses in own pattern matching formula, which although has a similar format to REGEX, it's not. The behaviour between the 2 should not be expected to be the same. – Thom A Apr 17 '19 at 08:10
  • thanks for poitning this out. Still, I need to know why the different behavior in PATINDEX according to different configurations. I write code for a software vendor whose customers use across all kinds of configs. I don't want to be awake at night wondering if/when something like this will break. Currently considering old-school alternatives to PATINDEX. – John Apr 17 '19 at 08:17
  • @John: FWIW, `LIKE` admits an explicit escape character that circumvent this problem -- `WHERE ... LIKE '%[ \-]%' ESCAPE '\'` has the expected semantics regardless of collation. Unfortunately `PATINDEX` is not so blessed, but you could combine them if rows that don't include the pattern will ultimately not be used in the result. – Jeroen Mostert Apr 17 '19 at 10:19
  • 1
    This doesn't just find the first space or hyphen though, it'll also find 12 other characters (I mentioned them in my answer). – HoneyBadger Apr 17 '19 at 10:49