Questions tagged [multibyte]

PHP (mbstring) provides multibyte specific string functions that help you deal with multibyte encodings

While there are many languages in which every necessary character can be represented by a one-to-one mapping to an 8-bit value, there are also several languages which require so many characters for written communication that they cannot be contained within the range a mere byte can code (A byte is made up of eight bits. Each bit can contain only two distinct values, one or zero. Because of this, a byte can only represent 256 unique values (two to the power of eight)). Multibyte character encoding schemes were developed to express more than 256 characters in the regular bytewise coding system.

When you manipulate (trim, split, splice, etc.) strings encoded in a multibyte encoding, you need to use special functions since two or more consecutive bytes may represent a single character in such encoding schemes. Otherwise, if you apply a non-multibyte-aware string function to the string, it probably fails to detect the beginning or ending of the multibyte character and ends up with a corrupted garbage string that most likely loses its original meaning.

mbstring provides multibyte specific string functions that help you deal with multibyte encodings in PHP. In addition to that, mbstring handles character encoding conversion between the possible encoding pairs. mbstring is designed to handle Unicode-based encodings such as UTF-8 and UCS-2 and many single-byte encodings for convenience.

327 questions
128
votes
3 answers

How does UTF-8 "variable-width encoding" work?

The unicode standard has enough code-points in it that you need 4 bytes to store them all. That's what the UTF-32 encoding does. Yet the UTF-8 encoding somehow squeezes these into much smaller spaces by using something called "variable-width…
dsimard
  • 4,245
  • 5
  • 22
  • 16
56
votes
4 answers

Ruby 1.9: how can I properly upcase & downcase multibyte strings?

So matz made the decision to keep upcase and downcase limited to /[A-Z]/i in ruby 1.9.1. ActiveSupport::Multibyte has long had great i18n case jiggering in ruby 1.8.x via String#mb_chars. However, when tried under ruby 1.9.1, it doesn't seem to…
kch
  • 77,385
  • 46
  • 136
  • 148
47
votes
9 answers

What is a multibyte character set?

Does the term multibyte refer to a charset whose characters can - but don't have to be - wider than 1 byte, (e.g. UTF-8) or does it refer to character sets which are in any case wider than 1 byte (e.g. UTF-16) ? In other words: What is meant if…
prinzdezibel
  • 11,029
  • 17
  • 55
  • 62
46
votes
8 answers

Multibyte trim in PHP?

Apparently there's no mb_trim in the mb_* family, so I'm trying to implement one for my own. I recently found this regex in a comment in php.net: /(^\s+)|(\s+$)/u So, I'd implement it in the following way: function multibyte_trim($str) { if…
federico-t
  • 12,014
  • 19
  • 67
  • 111
37
votes
3 answers

Difference between mb_substr and substr

Will it make any difference or impact on my result, if I use substr() instead of mb_substr() function? As my server does not have support for mb_ functions, I have to replace it with substr()
Poonam Bhatt
  • 10,154
  • 16
  • 53
  • 72
35
votes
8 answers

strtolower() for unicode/multibyte strings

I have some text in a non-English/foreign language in my page, but when I try to make it lowercase, it characters are converted into black diamonds containing question marks. $a = "Երկիր Ավելացնել"; echo $b = strtolower($a); //returns �����…
Simon
  • 22,637
  • 36
  • 92
  • 121
33
votes
5 answers

UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!

I've managed to mostly ignore all this multi-byte character stuff, but now I need to do some UI work and I know my ignorance in this area is going to catch up with me! Can anyone explain in a few paragraphs or less just what I need to know so that I…
dicroce
  • 45,396
  • 28
  • 101
  • 140
31
votes
4 answers

glob() can't find file names with multibyte characters on Windows?

I'm writing a file manager and need to scan directories and deal with renaming files that may have multibyte characters. I'm working on it locally on Windows/Apache PHP 5.3.8, with the following file names in a…
Wesley Murch
  • 101,186
  • 37
  • 194
  • 228
31
votes
5 answers

Are the PHP preg_functions multibyte safe?

There are no multibyte 'preg' functions available in PHP, so does that mean the default preg_functions are all mb safe? Couldn't find any mention in the php documentation.
Spoonface
  • 1,513
  • 1
  • 20
  • 29
30
votes
3 answers

str_replace() on multibyte strings dangerous?

Given certain multibyte character sets, am I correct in assuming that the following doesn't do what it was intended to do? $string = str_replace('"', '\\"', $string); In particular, if the input was in a character set that might have a valid…
user456885
  • 443
  • 1
  • 5
  • 8
27
votes
2 answers

How can I tell if a string contains multibyte characters in Javascript?

Is it possible in Javascript to detect if a string contains multibyte characters? If so, is it possible to tell which ones? The problem I'm running into is this (apologies if the Unicode char doesn't show up right for you) s = ""; alert(s.length); …
nickf
  • 537,072
  • 198
  • 649
  • 721
27
votes
1 answer

Printing UTF-8 strings with printf - wide vs. multibyte string literals

In statements like these, where both are entered into the source code with the same encoding (UTF-8) and the locale is set up properly, is there any practical difference between them? printf("ο Δικαιοπολις εν αγρω εστιν\n"); printf("%ls", L"ο…
teppic
  • 8,039
  • 2
  • 24
  • 37
24
votes
6 answers

Get size of a std::string's string in bytes

I would like to get the bytes a std::string's string occupies in memory, not the number of characters. The string contains a multibyte string. Would std::string::size() do this for me? EDIT: Also, does size() also include the terminating NULL?
小太郎
  • 5,510
  • 6
  • 37
  • 48
22
votes
2 answers

PHP mb_substr() not working correctly?

This code print mb_substr('éxxx', 0, 1); prints an empty space :( It is supposed to print the first character, é. This seems to work however: print mb_substr('éxxx', 0, 2); But it's not right, because (0, 2) means 2 characters...
Alex
  • 66,732
  • 177
  • 439
  • 641
19
votes
10 answers

Multi-byte safe wordwrap() function for UTF-8

PHP's wordwrap() function doesn't work correctly for multi-byte strings like UTF-8. There are a few examples of mb safe functions in the comments, but with some different test data they all seem to have some problems. The function should take the…
philfreo
  • 41,941
  • 26
  • 128
  • 141
1
2 3
21 22