2

The function json_encode requires a valid UTF-8 string. I have a string that may be in a different encoding. I need to ignore or substitute all invalid characters to be able to convert to JSON.

  1. It should be something very simple and robust.
  2. The error is in a module for manual checking, so mojibake is fine.
  3. The code responsible for fixing encoding is in a different module. (It was broken, though.) I don’t want to duplicate responsibility.

The hexadecimal representation of an example of an invalid string: 496e76616c6964206d61726b2096

My current solution:

$raw_str = hex2bin('496e76616c6964206d61726b2096');
$sane_str = @\iconv('UTF-8', 'UTF-8//IGNORE', $raw_str);

The three problems with my code:

  1. The iconv looks little too heavy.
  2. Many programmers don't like @.
  3. The iconv may ignore too much: the whole string.

Any better idea?

There is similar question, Ensuring valid UTF-8 in PHP, but I don't care about conversion.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Michas
  • 8,534
  • 6
  • 38
  • 62
  • If you don't care about conversion... what are you trying to do then? –  Apr 27 '16 at 12:26
  • I need a valid UTF-8 string for json_encode. The valid mojibake is fine. It’s all. – Michas Apr 27 '16 at 12:41
  • Honestly speaking, your solution is the cleanest I can see. If you don't want to use `@`, you might want to run the string through a encoding check, which is troublesome. –  Apr 27 '16 at 12:45

4 Answers4

4

You should look into mb_convert_encoding. It is able to convert text from pretty much any encoding to another. I had to use it for a similar project: http://php.net/manual/en/function.mb-convert-encoding.php

raphael75
  • 2,982
  • 4
  • 29
  • 44
2

I think this is the best solution.

$raw_str = hex2bin('496e76616c6964206d61726b2096');
$sane_str = mb_convert_encoding($raw_str, 'UTF-8', 'UTF-8');
Michas
  • 8,534
  • 6
  • 38
  • 62
1

Function json_encode expects a UTF-8 encoded string. Check the encoding using a function based on the W3C recommended regex answer in Ensuring valid UTF-8 in PHP:

function encodeUtf8($string){
 if (preg_match('%^(?:
      [\x09\x0A\x0D\x20-\x7E]            # ASCII
    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
    | \xE0[\xA0-\xBF][\x80-\xBF]         # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
    | \xED[\x80-\x9F][\x80-\xBF]         # excluding surrogates
    | \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
    | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
 )*$%xs', $string))
    return $string;
 else
    return iconv('CP1252', 'UTF-8', $string);
}

Then you could use it:

$sane_str = encodeUtf8($raw_str);
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
F.Igor
  • 4,119
  • 1
  • 18
  • 26
1

You could use mb_detect_encoding to detect if it's not UTF-8 and then use mb_convert_encoding to convert to text to UTF-8

<?php
/**
 * Convert json blob to UTF-8
 * @param $string String to be decoded
 * @return bool|string
 */
function convert_json($string)
{
    if (ctype_print($string)) { // binary
        return false;
    }
    $from = mb_detect_encoding($string, ['auto']);
    $to = 'UTF-8';
    if ($from !== $to) {
        $string = mb_convert_encoding($string, $to, $from);
    }
    return $string;
}