13

Is a translation of the below code at all possible using PHP?

The code below is written in JavaScript. It returns html with numeric character references where needed. Ex. smslån -> smslån

I have been unsuccessful at creating a translation. This script looked like it may work, but returns å for å instead of å as the javascript below does.

function toEntity() {
  var aa = document.form.utf.value;
  var bb = '';
  for(i=0; i<aa.length; i++)
  {
    if(aa.charCodeAt(i)>127)
    {
      bb += '&#' + aa.charCodeAt(i) + ';';
    }
    else
    {
      bb += aa.charAt(i);
    }
  }
  document.form.entity.value = bb;
}

PHP's ord function sounds like it does the same thing as charCodeAt, but it does not. I get 195 for å using ord and 229 using charCodeAt. That, or I am having some incredibly difficult encoding problems.

darkAsPitch
  • 1,855
  • 4
  • 23
  • 35
  • You mean [this?](http://www.php.net/manual/en/function.mb-encode-numericentity.php#88586), or phihag's answer below, basically? I don't see a utf8 version of ord anywhere. – darkAsPitch Sep 30 '11 at 11:52
  • I'm not sure. I tried playing around with Miguel's code for 20 minutes but it seems what phihag below suggested is exactly what I needed. In terms of this application anyways. Is there any reason to believe it's not? – darkAsPitch Oct 01 '11 at 14:21

1 Answers1

35

Use mb_encode_numericentity:

$convmap = array(0x80, 0xffff, 0, 0xffff);
echo mb_encode_numericentity($utf8Str, $convmap, 'UTF-8');
phihag
  • 278,196
  • 72
  • 453
  • 469
  • 1
    Yeah while I wanted to answer, I saw you did it already so I noticed. ;) It's really a cool function for the job. – hakre Sep 30 '11 at 10:12
  • 3
    The only thing that worries me is that $convmap - what is that exactly? There isn't a great explanation on the manual page. Do I have to enter all of the possible conversions or something? My feeble mind reads it as "conversion map". – darkAsPitch Sep 30 '11 at 11:52
  • 7
    @darkAsPitch It is messy. `$convmap` specifies which characters to encode. It should really be a callback function, but that would probably be slow, and the usage of callbacks in php predates the function anyways. The first two numbers specify the range(inclusive) of character codes to convert, and the third and forth and offset and a bitmask (0 and 0xfff for all practical purposes). For example, if you want to convert all characters to HTML entities, specify `array(0, 0xfff, 0, 0xfff)`. Basically, `(0x80, 0xffff, ..)` is the equivalent of `charCode > 127` in your question. – phihag Sep 30 '11 at 12:00