1

It's more than possible this already has an answer, however since encoding is far from my strong point I don't really know what to search for to find out.

Essentially, I have a string that contains (what I would call) 'bad' characters. For context, this string is coming back from a cURL response. Example:

$bad_str = "Sunday’s";

Question: How to swap these out for more readable substitutes?

This would be a lot easier if I knew what this sort of problem was called, or what sort of encoding it corresponded to. I have read:

I have tried creating a swaps map and running preg_replace_callback on it, i.e.:

$encoding_swapouts_map = array(
    '’' => "'",
    'é' => 'é',
    '–' => '-',
    '£' => '£'
);

$bad_str = preg_replace_callback(
    $ptn = '/'.implode('|', array_keys($encoding_swapouts_map)).'/i',
    function($match) use ($encoding_swapouts_map) {
        return $encoding_swapouts_map[$match[0]];
    },
    $str
);

This doesn't seem to match the bad characters, so the callback is never called. Interestingly, $ptn, when printed out, shows some mutation:

 /’|é|–|£/i

Thanks in advance.

Community
  • 1
  • 1
Mitya
  • 33,629
  • 9
  • 60
  • 107
  • Does the curl response have Content-Type header? It should tell you the encoding you get, then just convert to your encoding. – Marek Mar 24 '15 at 14:45
  • Thanks. I just looked and it's coming back as `text/xml;charset=UTF-8` – Mitya Mar 24 '15 at 14:48
  • 1
    I think what you are seeing are special ms word characters. Look at this answer: http://stackoverflow.com/questions/7419302/converting-microsoft-word-special-characters-with-php – Marek Mar 24 '15 at 14:51
  • 1
    Where does this come from in the first place? What you're trying to do is fix the symptom when the patient is already dead. You should never get into this situation in the first place if you'd be handling encodings correctly. Don't do this, fix the problem at the source. This is called "mojibake" BTW. Start here: [What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text](http://kunststube.net/encoding/) – deceze Mar 24 '15 at 14:59
  • Actually I did - see this question: http://stackoverflow.com/questions/28726515/encoding-certain-characters-coming-back-wrecked-through-curl-request ...but I didn't have a lot of joy from it. The answerer pointed out that controlling incoming encodings from cURL is famously difficult. I should point out I do not control the content source. – Mitya Mar 24 '15 at 15:02
  • If you don't know the encoding, you will need to handle encodings as browsers do: parse the HTTP headers to read the declared encoding there, fall back to HTML meta tags if the HTTP header does not exist; convert from the encoding you figured out this way to your desired encoding. This can be quite annoying, use a library which abstracts all that if possible. – deceze Mar 24 '15 at 15:31
  • As for critique of your question: sure, it's well written, verbal +1 for that. However, all it basically says is "I have an encoding problem", with too little detail on where it stems from. It offers little for us to go on in terms of offering a solution, and therefore doesn't add a whole lot to the Stackoverflow pool of knowledge overall. *You* weren't necessarily critiqued personally, *your question* was voted on based on how useful it is within SO. – deceze Mar 24 '15 at 15:35
  • This is not encoding problem, but problem with extended characters that your viewer (I suppose browser) is not able to display. – Marek Mar 24 '15 at 17:02

1 Answers1

-2

What happened to the answer that I liked? (it was deleted).
I think it had a typo, however.

  $text = "Sunday’s";
  $bad = array("’","é","–","£");
  $good = array("'","é","-","£");
  $newtext = str_replace($bad, $good, $text);
dcromley
  • 1,373
  • 1
  • 8
  • 23