5

Is there a way to determine for sure the minimum number of bytes required by a character in a specific encoding? Like one of the encodings supported by the mbstring extension. The value will be 1 for UTF-8, 2 for UTF-16, etc.

I don't want to obtain the length of a particular string or char.

I want to know the minimum char size supported by a given encoding, according to it's specification.

I currently use this code:

<?php

function flawed_detection($encoding)
{
    // I use 'a' in the hope that this char need the least number of bytes in all the supported encodings
    return strlen(mb_convert_encoding('a', $encoding, 'UTF-8'));
}

foreach (mb_list_encodings() as $encoding) {
    echo "$encoding: ", flawed_detection($encoding), "\n";
}

Partial output:

...
UTF-16LE: 2
UTF-8: 1
UTF-7: 1
UTF7-IMAP: 1
ASCII: 1
EUC-JP: 1
...

But I'm not sure of the "correct" character to use. If ever there is one.

edit: I've tested the brute-force approach with every chars from 0 to U+10FFFF in every encodings, and the results are exactly the same that with my finally_not_so_flawed_detection function (with the 'a' char or with space) :p

Ayell
  • 560
  • 2
  • 12
  • 1
    Possible duplicate of [Measure string size in Bytes in php](http://stackoverflow.com/questions/7568949/measure-string-size-in-bytes-in-php) – Kevin Kopf Aug 02 '16 at 22:51
  • Why? What's the goal here? Do you have a valid business or technical reason to not use UTF-8 across the board? – Peter Bailey Aug 03 '16 at 13:22
  • What why? This is a general purpose question :p And I use UTF-8 in my project, but I need to decode some strings in binary files. – Ayell Aug 03 '16 at 20:13
  • 1
    related brute-force approach in python: http://stackoverflow.com/questions/30870107/mapping-of-character-encodings-to-maximum-bytes-per-character – toucanb Aug 03 '16 at 23:14

1 Answers1

1

I'm not aware of any way you can tell for sure, but a reasonable approximation will be to check the width of the space character (" ", U+20, etc). As far as I know, every sane text encoding supports that character, and every variable-length encoding uses a minimum-length sequence for it.