Microsoft's implementation of lstrcmpi and Unicode characters

Question

I'm trying to understand whether what I'm seeing is a bug, or some accepted behaviour of the Microsoft's lstrcmpi function?

I can illustrate it with the code:

WCHAR buff1[] = L"abc ";
WCHAR buff2[] = L"abc ";
buff1[3] = 0xFFFF;
buff2[3] = 0x0;
int res = lstrcmpi(buff1, buff2);
//res is 0 or equality!

EDIT: Addition for the comment below:

@JonathanPotter: Hmm, aren't `buff1` and `buff2` allocated on the stack? — c00000fd, Mar 21 '17 at 01:34
@c00000fd No, not necessarily. Sometimes the compiler will optimize that away unless you specify absolutely no optimizatiokns. — CinchBlue, Mar 21 '17 at 01:44
In this case they are two separate variables in two separate memory areas. This is clearly visible in the disassembly. — 1201ProgramAlarm, Mar 21 '17 at 01:45
@VermillionAzure: OK. Maybe. The question is why `lstrcmpi` returns `0` on those two strings? Does it not see `FFFF` at the end? — c00000fd, Mar 21 '17 at 01:46
@Jonathan, the literals are only being used as initializers for the arrays, he isn't writing into them. — Harry Johnston, Mar 21 '17 at 01:54
http://stackoverflow.com/questions/3482683/can-a-valid-unicode-string-contain-ffff-is-java-characteriterator-broken has some discussion of U+FFFF leading to the general idea that it shouldn't normally appear in strings unless internal logic is using it as a sentinel or such. Perhaps lstrcmpi has some special case for it, but you're not supposed to trip that because you're not supposed to be passing it in at all. — TheUndeadFish, Mar 21 '17 at 02:00
@TheUndeadFish: Like I said in my comment to the answer below, it's not just `FFFF`. I see the same with `FFFE`.... maybe others. All in all, I think it's a dangerous behavior of low-level string comparison APIs. — c00000fd, Mar 21 '17 at 02:03
@c00000fd, this isn't a low-level comparison, it is locale-sensitive. — Harry Johnston, Mar 21 '17 at 02:09
Sounds like you're confusing it with `wcscmp` and/or `_wcsicmp` which will probably behave in the way you desire? — Harry Johnston, Mar 21 '17 at 02:11
@c00000fd - yes, the same with many other , say fa2e, fa2f,fa6e,... — RbMm, Mar 21 '17 at 02:11
@c00000fd Depends on what you're expecting vs what their goals were. For instance, do you expect L"é" to equal L"e\u0301" (e with comibing accent)? Since lstrcmpi does call them equal, I presume its goal is to work more in terms of what humans see rather than byte-for-byte. As such then, I'm assuming non-usable/non-printable characters like U+FFFF and U+FFFE are being treated as irrelevant. Whether that's dangerous or not... is not my call to make. But maybe it's just not intended for your use case? — TheUndeadFish, Mar 21 '17 at 02:13
Yes, [_wcsicmp](https://msdn.microsoft.com/en-us/library/k59z8dwe.aspx) seems to behave as I would expect -- it catches that `FFFF` character. Wow! One would think someone would note this in the MSDN for other APIs. And yes, this can be abused in many ways! — c00000fd, Mar 21 '17 at 02:24
It's behaving as expected. What did you expect to happen and why? — David Heffernan, Mar 21 '17 at 05:48

score 4 · Accepted Answer · edited Jun 20 '20 at 09:12

4

lstrcmpi calls CompareString with the current locale (from thread or user) and returns "a linguistically appropriate result".

From Michael Kaplans blog:

... Now if the functions were named lstrcoll and lstrcolli then perhaps the function would not be so commonly misused

and:

Remember that when checking for equality, especially on an item like a registry value where OS semantics are involved, the best answer is CompareStringOrdinal, with a fallback to RtlCompareUnicodeString or even better RtlEqualUnicodeString or if you absolutely must wcsicmp (with awareness that there is one character it can be wrong about) for anything that has to run pre-Vista.

and finally:

Because if you are calling lstrcmpi for appropriate reasons (i.e. you wanted to get linguistically meaningful results, say in the sorting of a list in a user interface) but you wanted to have behavior that did not vary with different locales, then CompareString with LOCALE_INVARIANT is a good answer.

But if you wanted almost anything else, including all of the non-linguistic purposes hinted at earlier, then CompareStringOrdinal or RtlCompareUnicodeString is a much better choice.

How it handles non-characters has actually changed over time.

edited Jun 20 '20 at 09:12

Community

1
1

answered Mar 21 '17 at 04:13

Anders

97,548
12
110
164

Thank you for the info. Yes, it seems like calling a kernel-mode `RtlCompareUnicodeString` API is the way to go. `CompareStringOrdinal` will not work on XP, if such is important. Although regarding `RtlCompareUnicodeString`, do I need a DDK to include it in a user-mode code? (Apart from linking to it dynamically.) – c00000fd Mar 21 '17 at 04:52
See http://archives.miloush.net/michkap/archive/2006/05/24/605599.html for equality vs sorting and perhaps you can get away with CompareString. For Rtl* you probably need GetProcAddress or a custom import library, it really depends on the SDK though so you could give it a shot and you might get lucky... – Anders Mar 21 '17 at 05:05
Hard to see why you conclude that RtlCompareUnicodeString is the solution, @c. What even is the problem? – David Heffernan Mar 21 '17 at 06:46
@c Questions should not be asked in comments. This function behaves as expected. Your expectations are awry. – David Heffernan Mar 21 '17 at 07:48
@DavidHeffernan: Well then, maybe your should start your own answer and explain yourself... – c00000fd Mar 21 '17 at 08:41
You don't state what your expectations are, and why you feel that this function should meet them. Did you read the documentation yet? – David Heffernan Mar 21 '17 at 08:47
`RtlCompareUnicodeString` is available in both - user and kernel mode. you not need `GetProcAddress` or *custom* import library. need use `ntdll[p].lib` from *WDK* about declaration - if you not plan wide use native api - easiest way declare this single function yourself – RbMm Mar 21 '17 at 09:23
@RbMm: That's what I meant by installing DDK (OK, it's now called WDK, I guess.) Other than that, how do I declare it myself w/o a `.lib` file? – c00000fd Mar 21 '17 at 19:14
@c00000fd - declare need function definition for compiler (can simply copy/paste from `wdm.h` and add `ntdll[p].lib` for linker. and all – RbMm Mar 21 '17 at 19:17
@RbMm: Like I said, I don't have DDK installed. – c00000fd Mar 21 '17 at 19:21
Ouch. I'm beginning to see why UNIX just makes everything case-sensitive, it may be a pain for the user but it looks like it avoids a lot of problems ... – Harry Johnston Mar 21 '17 at 20:38

score 2 · Answer 2 · answered Mar 21 '17 at 01:57

2

The Unicode FFFF character is a noncharacter in the Unicode spec, so it is probably being ignored during the string comparison. This results in both strings being equal.

answered Mar 21 '17 at 01:57

1201ProgramAlarm

32,384
7
42
56

Yeah, it's an interesting behavior. And it's not just `lstrcmpi`. I see the same behavior in `lstrcmp` and `CompareString` (with and without case sensitivity.) Also with other chars, like `FFFE` for instance. The only one that catches it is C's `memcmp` with a prior `strlen` check. – c00000fd Mar 21 '17 at 02:00
@c00000fd According to Microsoft's documentation, `lstrcmpi` calls `CompareString`. U+FFFE is also a noncharacter. – 1201ProgramAlarm Mar 21 '17 at 02:03
`CompareStringEx()` has several flags that tell the function what kind of characters and differences to ignore, such as diacritics, symbols, case, and the difference between fullwidth and halfwidth. So some combination of those might fit your use case? – Davislor Mar 21 '17 at 02:26

Microsoft's implementation of lstrcmpi and Unicode characters

2 Answers2