2

Not to reinvent the wheel I refer to already existing Cyrillic characters in PHP's json_encode.

The question is: what are those characters, what do they mean: \u0435, \u0434 and so on? I guess there is nothing to do with number of bytes, is that just a serial number in UTF-8 that corresponds to cyrillic symbols "е", "д" and so on respectively?

Community
  • 1
  • 1
Vadim Samokhin
  • 3,378
  • 4
  • 40
  • 68

1 Answers1

3

These are Unicode escape sequences that reference characters in the Unicode character set by denoting their code points in hexadecimal.

From the JSON specification:

Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point. The hexadecimal letters A though F can be upper or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".

Although these characters do not need to be escaped (see unescaped rule), json_encode does encode any character except those character that are also in US-ASCII (see source of json.c) to avoid encoding issues with US-ASCII-based protocols.

So inside a JSON string, \u0435 references the character at U+0435 that is the CYRILLIC SMALL LETTER IE (е) and \u0434 references the character at U+0434 that is the CYRILLIC SMALL LETTER DE (д).

Community
  • 1
  • 1
Gumbo
  • 643,351
  • 109
  • 780
  • 844