3

I have a UTF-8 encoded xml file, which was exported from a Wordpress MySQL database.

While the file is saved as UTF-8, and the encoding is UTF-8, I get gibberish instead of the Hebrew text that is supposed to be in there, which looks like this:

™×•טות

How can I find the original encoding or charset and convert the text into proper Hebrew?

PHP's mb_detect_encoding($str); returns UTF-8

Tried all sorts of php encoding functions, with different settings and input/output charsets, but they all just print different looking gibberish blocks, like:

ÃâÃËÃâ¢Ãâ¢ÃËÃ

and

�� ×שמ×

...Any Ideas how to go about this?

Adam Tal
  • 921
  • 5
  • 14
  • 35

4 Answers4

3
function convert($str) {
    $hebrew = array("א", "ב", "ג", "ד", "ה", "ו", "ז", "ח", "ט", "י", "כ", "ל", "מ", "נ", "ס", "ע", "פ", "צ", "ק", "ר", "ש", "ת", "ך", "ם", "ן", "ף", "ץ");
    $gibberish = array("à", "á", "â", "ã", "ä", "å", "æ", "ç", "è", "é", "ë", "ì", "î", "ð", "ñ", "ò", "ô", "ö", "÷", "ø", "ù", "ú", "ê", "í", "ï", "ó", "õ");
    return str_replace($gibberish, $hebrew, $str);
}

$hebrew_string = convert(utf8_encode($gibberish_string));
1

In case you have access to the database, you can fix it easily by exporting it as latin1 and importing as UTF8. As it has been suggested here.

Community
  • 1
  • 1
Tomer Cohen
  • 374
  • 6
  • 13
0

This is very similar to this question.

From what I could see, this is a mangled Unicode string, where each unicode character got encoded as two unicode characters.

The code I came up with simply discarded the empty high-order byte and reconstructed the original byte array from that. The code is only an example and is very simplistic in approach, but should help you get there.

Community
  • 1
  • 1
Oded
  • 489,969
  • 99
  • 883
  • 1,009
0

take a look at your php file, maybe it isn't utf-8 and thats the reason why your xml query returns this unwanted string.

therufa
  • 2,050
  • 2
  • 25
  • 39