3

I want to use REGEXP_INSTR() within an oracle database to check for lower/uppercase characters. I'm aware of [:upper:] and [:lower:] POSIX character classes, but I went with a-z that gives me really weird results I don't understand. Can someone explain this?

SELECT REGEXP_INSTR('abc','[A-Z]',1,1,0,'c') FROM DUAL
-- Got 2, expected 0

SELECT REGEXP_INSTR('zyx','[A-Z]',1,1,0,'c') FROM DUAL
-- Got 1, expected 0

SELECT REGEXP_INSTR('ABC','[a-z]',1,1,0,'c') FROM DUAL
-- Got 1, expected 0

SELECT REGEXP_INSTR('ZYX','[a-z]',1,1,0,'c') FROM DUAL
-- Got 2, expected 0

SELECT REGEXP_INSTR('a3','[A-F0-9]',1,1,0,'c') FROM DUAL
-- Got 2, expected 2

SELECT REGEXP_INSTR('b3','[A-F0-9]',1,1,0,'c') FROM DUAL
-- Got 1, expected 2

SELECT REGEXP_INSTR('b3','[A-F0-9]') FROM DUAL
-- Got 1, expected 1 or 2

SELECT REGEXP_INSTR('a3','[A-F0-9]') FROM DUAL
-- Got 2, expected same as above
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
xsrf
  • 564
  • 5
  • 18
  • 1
    *`'c': Turn on case sensitive matching.`* So `REGEXP_INSTR('zyx','[A-Z]',1,1,0,'c')` will match `z`, for example, and you can't expect `0`. Mind that *The first character in the string has position 1.* See https://www.regular-expressions.info/oracle.html – Wiktor Stribiżew Sep 18 '19 at 07:04
  • Yes, 'c' turns on case sensitive matching. 'z' is lowercase. I expect '[A-Z]' to match uppercase characters from A to Z. Why would it match z? – xsrf Sep 18 '19 at 07:08
  • Please check [this post](https://stackoverflow.com/a/5391234/3832970), check if `NLS_COMP` and `NLS_SORT` session parameters interfere with regex case sensitive matching. – Wiktor Stribiżew Sep 18 '19 at 07:15
  • Lets assume `c` made `[A-Z]` equal to `[A-Za-z]`, which I doubt, because that would be case-IN-sensitive, why wouldn't then the first query return `1`? – xsrf Sep 18 '19 at 07:15
  • @WiktorStribiżew I think is not quite correct. why is the result in the first example then 2 and not 1? – hotfix Sep 18 '19 at 07:20
  • I understand that `NLS_COMP` and `NLS_SORT` may have an impact here regarding case sensitivity. What I really don't understand at all are the last two queries. How on earth can they give different results? If `b` matches `A-F`, why wouldn't `a`? – xsrf Sep 18 '19 at 07:20
  • 1
    okay wait, it makes sense if the order is `aAbBcC` and `A-F` actually expands to `AbBcCdDeEfF` and `a-f` expands to `aAbBcCdDeEf`... wow – xsrf Sep 18 '19 at 07:25
  • Looks like the collation effect. Try setting `NLS_SORT` as `BINARY` – Wiktor Stribiżew Sep 18 '19 at 07:25
  • 1
    @xsrf if you set `NLS_SORT = binary` you will get your expected result – hotfix Sep 18 '19 at 07:33

2 Answers2

2

The reason for the behavior is the collation rules. See the NLS_SORT documentation:

  • If the value is BINARY, then the collating sequence for ORDER BY queries is based on the numeric value of characters (a binary sort that requires less system overhead).
  • If the value is a named linguistic sort, sorting is based on the order of the defined linguistic sort. Most (but not all) languages supported by the NLS_LANGUAGE parameter also support a linguistic sort with the same name.

Set the NLS_SORT to BINARY so that the [A-Z] could be parsed in the same order as in the ASCII table,

alter session set nls_sort = 'BINARY'

Then, you will get consistent results.

See the online demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
-1

Okay, the answer that NLS_SORT causes this behavior is correct, but I don't think it explains it in an understandable way. None of the documentation I found actually does that...

You have to imagine that the character ranges defined by [a-z] are actually derived from a single substring of all possible characters which are sorted depending on NLS_SORT.

Lets assume the whole alphabet is just alphanumerical characters. Sorted by BINARY this results in a base string like 0123456789abcdefgh...xyzABCDE...XYZ. Derived from this, [0-6] expands to [0123456], [a-f] to [abcdef], [5-b] to [56789ab] etc.

Sorted by a linguistic_definition however results in a different base string, like 0123456789aAbBcCdDeF...xXyYzZ. Derived from this, [0-6] still expands to [0123456], but [a-f] now expands to [aAbBcCdDeEf] and [5-b] to [56789aAb] etc...

This is why a did not match [A-Z], but b did. [A-Z] actually expands to [AbBcC...yYzZ] which includes z but not a.

In reality [A-Z] might even contain more characters, like [aAàáâÀÁÂ...] etc.

xsrf
  • 564
  • 5
  • 18