20

I am trying to convert Word text pasted by users that contain MS Word ellipsis and long dash before processing it further.

I found an old proposed solution here to the problem http://www.codingforums.com/archive/index.php/t-47163.html , but it does not work for me. After replacing the ellipsis for example , the variable comes back as empty. Never seen anything like this before:

$src = "Long word dash – and weird Word ellipsis…";
$src = str_replace("‘", "'", $src);
$src = str_replace("’", "'", $src);
$src = str_replace("”", '"', $src);
$src = str_replace("“", '"', $src);
$src = str_replace("–", "-", $src);
$src = str_replace("…", "...", $src);
print $src;

Any ideas?

giorgio79
  • 3,787
  • 9
  • 53
  • 85
  • See my answer on **[this question](http://stackoverflow.com/questions/6698785/modify-simplify-topic-title-for-displaying-in-url)**. It's not going to cover every scenario, but should handle most common ones. – simshaun Sep 14 '11 at 16:01
  • I realized that the encoding of my php file was ANSI, and mysql had some non utf8 general encoding as well. Correcting these, my function and the one below both work. Much appreciated from everyone. – giorgio79 Sep 14 '11 at 18:22

4 Answers4

50

For anyone getting the diamond question mark in PHP, this method of replacing UTF-8 characters worked better than using the chr function.

$search = [                 // www.fileformat.info/info/unicode/<NUM>/ <NUM> = 2018
                "\xC2\xAB",     // « (U+00AB) in UTF-8
                "\xC2\xBB",     // » (U+00BB) in UTF-8
                "\xE2\x80\x98", // ‘ (U+2018) in UTF-8
                "\xE2\x80\x99", // ’ (U+2019) in UTF-8
                "\xE2\x80\x9A", // ‚ (U+201A) in UTF-8
                "\xE2\x80\x9B", // ‛ (U+201B) in UTF-8
                "\xE2\x80\x9C", // “ (U+201C) in UTF-8
                "\xE2\x80\x9D", // ” (U+201D) in UTF-8
                "\xE2\x80\x9E", // „ (U+201E) in UTF-8
                "\xE2\x80\x9F", // ‟ (U+201F) in UTF-8
                "\xE2\x80\xB9", // ‹ (U+2039) in UTF-8
                "\xE2\x80\xBA", // › (U+203A) in UTF-8
                "\xE2\x80\x93", // – (U+2013) in UTF-8
                "\xE2\x80\x94", // — (U+2014) in UTF-8
                "\xE2\x80\xA6"  // … (U+2026) in UTF-8
    ];

    $replacements = [
                "<<", 
                ">>",
                "'",
                "'",
                "'",
                "'",
                '"',
                '"',
                '"',
                '"',
                "<",
                ">",
                "-",
                "-",
                "..."
    ];

    str_replace($search, $replacements, $string);
David
  • 2,053
  • 2
  • 16
  • 26
Verron Knowles
  • 883
  • 8
  • 9
  • Dude, thank you. I don't know what's going on with any of the HTML parsing libraries but they all seem to spit back nasty character replacements... I think they assume that the charset is ISO-8859-1 by default – Funktr0n May 14 '14 at 22:22
  • Thanks Verron! Just noticed the fileformat url should be www.fileformat.info/info/unicode/char// – user697576 Jul 22 '14 at 21:39
  • Happy i could help! None of the other solutions were working %100, so i figured I'd share. – Verron Knowles Jul 23 '14 at 03:58
  • 2
    Verron Knowles answer is excellent, but is missing the final member of the second array. Also, if you're using PHP < 5.4, you'll have to convert the array form to array(). – Bob Ray Nov 09 '14 at 20:09
  • If you combine Verron Knowles' list of Unicodes with the one from here: http://stackoverflow.com/questions/20025030/convert-all-types-of-smart-quotes-with-php you get a pretty complete list. THANKS!! – cbloss793 Dec 09 '15 at 17:32
  • I was using this approach, but then started facing more and more characters coming which were not in this list. So I searched more and found "iconv" function in PHP docs. and used following approach: $str = iconv("UTF-8", "ASCII//TRANSLIT", $str); – Gagik Sukiasyan Feb 08 '17 at 13:51
  • The above worked for me except for characters that are not visible and then turned to  (i suspect double or triple spaces). The solution provided by chrstopher_b below did not work for me with the apostrophes but did work for those invisible characters. adding chr(194) to the $search array from the answer above and "" to the $replacement sorted the problem out – user1620090 Oct 12 '17 at 11:13
  • `str_replace($search, $replacements, $string);` is not worked for me. I used `preg_replace($search, $replacements, $string);` – sridharnetha May 08 '18 at 08:15
10

Hmm. I use this function for sanitizing text copied into an RTE. It may or may not work in this case. It converts to HTML entities, but you could tweak it to just convert to regular characters:

function convertFromCP1252($string)
{
    $search = array('&',
                    '<',
                    '>',
                    '"',
                    chr(212),
                    chr(213),
                    chr(210),
                    chr(211),
                    chr(209),
                    chr(208),
                    chr(201),
                    chr(145),
                    chr(146),
                    chr(147),
                    chr(148),
                    chr(151),
                    chr(150),
                    chr(133),
                    chr(194)
                );

     $replace = array(  '&amp;',
                        '&lt;',
                        '&gt;',
                        '&quot;',
                        '&#8216;',
                        '&#8217;',
                        '&#8220;',
                        '&#8221;',
                        '&#8211;',
                        '&#8212;',
                        '&#8230;',
                        '&#8216;',
                        '&#8217;',
                        '&#8220;',
                        '&#8221;',
                        '&#8211;',
                        '&#8212;',
                        '&#8230;',
                        ''
                    );

    return str_replace($search, $replace, $string);
}
christopher_b
  • 958
  • 5
  • 13
5

Great solution. I copied and pasted it and it worked with out a problem. On further study, I added a few characters that were not in the search and replace array. In order to find the ASCII character id numbers, I wrote a PHP function which shows what the ASCII character number is:

function stdump($s){

  for($i=0;$i<strlen($s);$i++){

    echo substr($s,$i,1) . "(" . ord(substr($s,$i,1)) . ")";

  }

  echo "<br/>";
}

The character is display and next to it the ascii number is show in parenthesis. Like this:

echo stdump("GPUs…");

produces:

G(71)P(80)U(85)s(115)â(226)€(128)¦(166)

Hope this helps.

--Keith

leekei
  • 149
  • 3
  • 9
0

it works for me:

$str=file_get_contents($file); 

$array=array("‘"=>"'","’"=>"'","”"=>'"',"“"=>'"',"–"=>"-","—"=>"-","–"=>"-","…"=>"...");

$str = strtr( $str,$array);

file_put_contents($file,$str);  
4b0
  • 21,981
  • 30
  • 95
  • 142