Is it safe to use `strstr` to search for multibyte UTF-8 characters in a string?

Question

Following my previous question: Why `strchr` seems to work with multibyte characters, despite man page disclaimer?, I figured out that strchr was a bad choice.

Instead I am thinking about using strstr to look for a single character (multi-byte not char):

const char str[] = "This string contains é which is a multi-byte character";
char * pos = strstr(str, "é"); // 'é' = 0xC3A9: 2 bytes 
printf("%s\n", pos);

Ouput:

é which is a multi-byte character

Which is what I expect: the position of the 1st byte of my multi-byte character.

A priori, this is not the canonical use of strstr but it seems to work well.
Is this workaround safe ? Can you think about any side-effect or special case that would cause a bug ?

[EDIT]: I should precise that I do not want to use wchar_t type and that strings I handle are UTF-8 encoded (I am aware this choice can be discussed but this an irrelevant debate)

_"A priori, this is not the canonical"_ Isn't it? It's just a UTF8 encoded string. — Adriano Repetti, Aug 29 '14 at 15:41
It depends, how "normal" your implementation is. The locale-specific multibyte encoding could be UTF-7 (or anything else having state), in which case `strstr` may yield false positives. — mafso, Aug 29 '14 at 15:47
I should ask. Why not use proper UTF-8 functions instead of? — Jack, Aug 29 '14 at 15:49
@Jack If you are talking about `wchar` I got the same question in my previous post ^^ : see [why](http://stackoverflow.com/questions/25566356/why-strchr-seems-to-work-with-multibyte-characters-despite-man-page-disclaime#comment39933408_25566356) — n0p, Aug 29 '14 at 16:01
possible duplicate of [Does encoding affect the result of strstr() (and related functions)](http://stackoverflow.com/questions/8209466/does-encoding-affect-the-result-of-strstr-and-related-functions) — Ross Ridge, Aug 29 '14 at 16:09
The aspect of UTF-8 should be in the post title and tag as it is a key point in the post. — chux - Reinstate Monica, Aug 29 '14 at 17:12
check [C's strstr function, for instance, will work perfectly as long as both its inputs are valid, null-terminated UTF-8 strings](https://stackoverflow.com/a/313596/5407848) — Accountant م, Apr 07 '19 at 20:51

score 8 · Accepted Answer · edited Apr 07 '19 at 16:11

8

Edit
Based on updated question from OP that "can such false positive exist in an UTF-8 context" So the answer is UTF-8 is designed in such a way that it is immune to partial mismatch of character as shown above and cause any false positive. So it is completely safe to use strstr with UTF-8 coded multibyte characters.

Original Answer
No strstr is not suitable for strings containing multi-byte characters.

If you are searching for a string that doesn't contain multi-byte character inside a string that contains multi-byte character, it may give false positive. (While using shift-jis encoding in japanese locale, strstr("掘something", "@some") may give false positive)

+---------+----+----+----+
|   c1    | c2 | c3 | c4 |  <--- string
+---------+----+----+----+

     +----+----+----+
     | c5 | c2 | c3 |  <--- string to search
     +----+----+----+

If trailing part of c1 (accidentally) matches with c5, you may get incorrect result. I would suggest using unicode with unicode substring check function or multibyte substring check functions. (_mbsstr for example)

edited Apr 07 '19 at 16:11

Accountant م

6,975
3
41
61

answered Aug 29 '14 at 15:49

Mohit Jain

30,259
8
73
100

Thanks, I had the intuition I was missing something. But now the question is: can such false positive exist in an UTF-8 context ? – n0p Aug 29 '14 at 16:03
5

You can't get false positives with UTF-8 because the initial byte of character is always different than any of the possible subsequent characters. – Ross Ridge Aug 29 '14 at 16:09
4

As Ross has already mentioned use of strstr for utf-8 and completely safe. UTF-8 codes are generated in a way that false positives are not possible between characters of UTF-8 character set. – Mohit Jain Aug 29 '14 at 16:11
Based on OP's additional information that strings are UTF-8, this answer is wrong. It should at least be updated with additional information to make it clear that `strstr` is perfectly safe for OP's usage and that the concerns in the answer only apply to legacy encodings like Shift_JIS. – R.. GitHub STOP HELPING ICE Aug 29 '14 at 16:42
Helpful answer ! What about strcmp ? – Virus721 May 18 '15 at 13:59
@virus721 `strcmp` is absolutely fine as long as 1. No character has byte 0 in it 2. Encoding of both strings is same. – Mohit Jain May 20 '15 at 05:08

score 1 · Answer 2 · answered Aug 29 '14 at 16:03

Modern systems use UTF-8 (or ASCII) as their multibyte encoding, where the use of this function is safe.

To be strictly conforming and make your code work even on old/exotic platforms, you need to take additional problems into account.

First, the good news: In every multibyte encoding, a 0-byte indicates the end of a string, regardless of state. This means, your strstr won’t cause a crash or something, but the result may be wrong.

As an example, consider UTF-7, a 7-bit clean way to encode Unicode. UTF-7 is a multibyte encoding having a shift state, which means how a byte is interpreted may depend on the context where it appears. E.g. (cf. Wikipedia) “£1AKM” is encoded as +AKM-AKM in UTF-7, where the + sign changes the state and the interpretation of letters like A. Doing strstr(str, "AKM") would match the first AKM portion (after the +), although this is part of the encoding of £ and actually should match the AKM portion after the - (setting the shift state back to the initial state).

I forgot to precise that I use UTF-8 encoding, but thanks for the tips anyway. — n0p, Aug 29 '14 at 18:15

Is it safe to use `strstr` to search for multibyte UTF-8 characters in a string?

2 Answers2

Linked