Oracle REGEXP_INSTR() and "a-z" character range doesn't match as expected

Question

I want to use REGEXP_INSTR() within an oracle database to check for lower/uppercase characters. I'm aware of [:upper:] and [:lower:] POSIX character classes, but I went with a-z that gives me really weird results I don't understand. Can someone explain this?

SELECT REGEXP_INSTR('abc','[A-Z]',1,1,0,'c') FROM DUAL
-- Got 2, expected 0

SELECT REGEXP_INSTR('zyx','[A-Z]',1,1,0,'c') FROM DUAL
-- Got 1, expected 0

SELECT REGEXP_INSTR('ABC','[a-z]',1,1,0,'c') FROM DUAL
-- Got 1, expected 0

SELECT REGEXP_INSTR('ZYX','[a-z]',1,1,0,'c') FROM DUAL
-- Got 2, expected 0

SELECT REGEXP_INSTR('a3','[A-F0-9]',1,1,0,'c') FROM DUAL
-- Got 2, expected 2

SELECT REGEXP_INSTR('b3','[A-F0-9]',1,1,0,'c') FROM DUAL
-- Got 1, expected 2

SELECT REGEXP_INSTR('b3','[A-F0-9]') FROM DUAL
-- Got 1, expected 1 or 2

SELECT REGEXP_INSTR('a3','[A-F0-9]') FROM DUAL
-- Got 2, expected same as above

*`'c': Turn on case sensitive matching.`* So `REGEXP_INSTR('zyx','[A-Z]',1,1,0,'c')` will match `z`, for example, and you can't expect `0`. Mind that *The first character in the string has position 1.* See https://www.regular-expressions.info/oracle.html — Wiktor Stribiżew, Sep 18 '19 at 07:04
Yes, 'c' turns on case sensitive matching. 'z' is lowercase. I expect '[A-Z]' to match uppercase characters from A to Z. Why would it match z? — xsrf, Sep 18 '19 at 07:08
Please check [this post](https://stackoverflow.com/a/5391234/3832970), check if `NLS_COMP` and `NLS_SORT` session parameters interfere with regex case sensitive matching. — Wiktor Stribiżew, Sep 18 '19 at 07:15
Lets assume `c` made `[A-Z]` equal to `[A-Za-z]`, which I doubt, because that would be case-IN-sensitive, why wouldn't then the first query return `1`? — xsrf, Sep 18 '19 at 07:15
@WiktorStribiżew I think is not quite correct. why is the result in the first example then 2 and not 1? — hotfix, Sep 18 '19 at 07:20
I understand that `NLS_COMP` and `NLS_SORT` may have an impact here regarding case sensitivity. What I really don't understand at all are the last two queries. How on earth can they give different results? If `b` matches `A-F`, why wouldn't `a`? — xsrf, Sep 18 '19 at 07:20
okay wait, it makes sense if the order is `aAbBcC` and `A-F` actually expands to `AbBcCdDeEfF` and `a-f` expands to `aAbBcCdDeEf`... wow — xsrf, Sep 18 '19 at 07:25
Looks like the collation effect. Try setting `NLS_SORT` as `BINARY` — Wiktor Stribiżew, Sep 18 '19 at 07:25
@xsrf if you set `NLS_SORT = binary` you will get your expected result — hotfix, Sep 18 '19 at 07:33

score 2 · Answer 1 · answered Sep 18 '19 at 07:41

The reason for the behavior is the collation rules. See the NLS_SORT documentation:

If the value is BINARY, then the collating sequence for ORDER BY queries is based on the numeric value of characters (a binary sort that requires less system overhead).

If the value is a named linguistic sort, sorting is based on the order of the defined linguistic sort. Most (but not all) languages supported by the NLS_LANGUAGE parameter also support a linguistic sort with the same name.

Set the NLS_SORT to BINARY so that the [A-Z] could be parsed in the same order as in the ASCII table,

alter session set nls_sort = 'BINARY'

Then, you will get consistent results.

See the online demo.

score -1 · Answer 2 · answered Sep 18 '19 at 17:01

Okay, the answer that NLS_SORT causes this behavior is correct, but I don't think it explains it in an understandable way. None of the documentation I found actually does that...

You have to imagine that the character ranges defined by [a-z] are actually derived from a single substring of all possible characters which are sorted depending on NLS_SORT.

Lets assume the whole alphabet is just alphanumerical characters. Sorted by BINARY this results in a base string like 0123456789abcdefgh...xyzABCDE...XYZ. Derived from this, [0-6] expands to [0123456], [a-f] to [abcdef], [5-b] to [56789ab] etc.

Sorted by a linguistic_definition however results in a different base string, like 0123456789aAbBcCdDeF...xXyYzZ. Derived from this, [0-6] still expands to [0123456], but [a-f] now expands to [aAbBcCdDeEf] and [5-b] to [56789aAb] etc...

This is why a did not match [A-Z], but b did. [A-Z] actually expands to [AbBcC...yYzZ] which includes z but not a.

In reality [A-Z] might even contain more characters, like [aAàáâÀÁÂ...] etc.

Oracle REGEXP_INSTR() and "a-z" character range doesn't match as expected

2 Answers2

Linked