1

I'm making a cURL request to a third party website which returns a text file on which I need to do a few string replacements to replace certain characters by their html entity equivalents e.g I need to replace í by í.

Using string_replace/preg_replace_callback on the response directly didn't result in matches (whether searching for í directly or using its hex code \x00\xED), so I used utf8_encode() before carrying out the replacement. But utf8_encode replaces all the í characters by Ã.

Why is this happening, and what's the correct approach to carrying out UTF-8 replacements on an arbitrary piece of text using php?

*edit - some further research reveals

utf8_decode("í") == í;
utf8_encode("í") == í;
utf8_encode("\xc3\xad") ==  í;
wheresrhys
  • 22,558
  • 19
  • 94
  • 162
  • Are you sure that the text is not already utf-8? Getting an `Ã` after an encode run would suggest that you're now double-encoding the text. – Marc B May 06 '12 at 19:13
  • @Marc B I'm not sure as it's a third party site that I'm getting the cURL response from, but teh html pages on taht site explicitly specify UTF-8 so I expect the text file would be as well. I was trying utf_encode in response to the original str_replace not working, and am no closer to figuring out why that is. – wheresrhys May 06 '12 at 19:21
  • and of course, are you sure you're outputting into a utf-8 environment? dumping utf-8 text into an iso8859 page will give the same effect. – Marc B May 06 '12 at 19:54

2 Answers2

1

utf8_encode is definitely not the way to go here (you're double-encoding if you do that).

Re. searching for the character directly or using its hex code, did you make sure to add the u modifier at the end of the regex? e.g. /\x00\xED/u?

Ansari
  • 8,168
  • 2
  • 23
  • 34
  • I wasn't sure what the syntax for utf8 in regex was, so thanks for that... but it doesn't even work with str_replace (using advice from here http://stackoverflow.com/questions/3959626/replace-unicode-character) – wheresrhys May 06 '12 at 19:57
  • Well, can you try preg_replace instead? As for str_replace - perhaps the file you save this code in needs to be saved in a certain encoding or with a certain marker (like in the answer you linked to). – Ansari May 06 '12 at 20:40
1

You're probably specify the characters/strings you want replaced via string literals in the php source code? If you do, then the values of those string literals depends on the encoding you save your php file in. So while you see the character í, maybe the literal value is a latin encoded í, like maybe 8859-1 encoding, or maybe its windows cp1252 í, or maybe its utf8 í, or maybe even utf32 í...i dont know off hand how many of those are different, but i know at least some have different byte representations, and so wont match in a php string comparison.

my point is, you need to specify the correct character that will match whatever encoding your incoming text is in.

heres an example without using literals

$iso8859_1 = chr(236);
$utf8 = utf8_encode(chr(236));

be warned, text editors may or may not convert the existing characters when you change the encoding, if you decide to change the file encoding to utf8. I've seen editors do really bizarre things when changing the encoding. Start with a fresh file.

also-just because the other server claims its utf8, doesn't mean it really is.

goat
  • 31,486
  • 7
  • 73
  • 96
  • I'll test this out later as it looks like a good answer. I did however work out that I was over-complicating the problem. A better solution was just to `utf8_encode` the string, and then when `json_encode` is called on an array containing the string it doesn't break any more and can be passed successfully to my js app - no need to do any replacements. – wheresrhys May 08 '12 at 08:36