utf-8 character set, 7bit encoding, PHP adding strange characters

Question

I'm sorry my title is not better, but I'm not even sure how to categorize this problem. I know this has to do with encoding, but I am not sure how.

I am doing a project for an ESP. Their emails are 7-bit encoded, with utf-8 character set (which doesn't really make sense to me).

Exhibit A:

encoding settings

I get the html email text via an API. I then use PHP to modify some of the text (via a str_replace), and then post the new html via the API.

All is fine, except every time I post, I am getting some strange characters, i.e. every time I run the code it adds another funky character.

Here is the affected section of the email before I make any changes (this is in "view" mode, i.e. how a browser would see it):

start

Here is the code that produces that Copyright symbol AND the A with the "acute" symbol above it:

                            © 2012 H

What's strange is that the only way to get rid of that A with the "acute" symbol above it is to delete the copyright symbol...somehow they are related.

Every time I post to the API via PHP, I get some new funky characters, thus:

1st post:

enter image description here

2nd post:

enter image description here

3rd post:

enter image description here

It's so strange...this is the only part that is not working! Please help...this is making me crazy! :-)

EDIT:

Here's the relevant PHP:

Get the html from an xml response:

$html = (string)$data;
Replace some stuff:

$newHTML = str_replace($oldExpiresString, $newExpiresString, $html);
Put the new HTML into the xml post variables:

$input = ''.$newHTML.'';
URLEncode it:

$formatted = urlencode($input);
Post via curl:

$postVariables = array( 'type' => urlencode($type), 'activity' => urlencode($activity), 'input' => urlencode($input) );

$rawResponseString = post_url($urlBase, $postVariables); print $rawResponseString;

Is your PHP script in UTF-8 itself? If not you're passing a non-UTF8 character then expecting the server to understand what it is. — Alasdair, Feb 25 '13 at 06:38
Have you tried [`utf8_decode`](http://www.php.net/manual/en/function.utf8-decode.php)? — Jon, Feb 25 '13 at 06:41
Hey guys...I'm not doing anything with the php encoding wise...I will add the php...I'm not sure what I SHOULD do with the php for proper encoding/decoding. — richard, Feb 25 '13 at 06:41
Look at [`utf8_encode`](http://php.net/manual/en/function.utf8-encode.php) for how `utf8` works. Unless you specify utf8 output, your output isn't being converted to it, so it would need to be decoded. — Jon, Feb 25 '13 at 06:43
FYI I don't know much about encoding so you guys will have to help me understand what to do... — richard, Feb 25 '13 at 06:47
when you do `$html = (string)$data;`, try `$html = utf8_decode($data);` (Also, why are you using `urlencode` for a `curl` post operation?) — Jon, Feb 25 '13 at 06:49
Jon, for some reason the curl post fails when posting to the api if I don't urlencode. — richard, Feb 25 '13 at 06:53
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/25062/discussion-between-jon-and-richard-deslonde) — Jon, Feb 25 '13 at 06:53
@Jon Please stop recommending `utf8_decode` and read [What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text](http://kunststube.net/encoding/) instead. — deceze, Feb 25 '13 at 07:35
@deceze I do know that, but the OP is getting a utf response when not expecting one, and at some point it is being converted with `htmlentites` I don't believe that their php is let up to handle utf naturally. — Jon, Feb 25 '13 at 07:39
It looks like the raw data is already encoded into html entities. — richard, Feb 25 '13 at 07:41
The characters are not being added to the code in my php...I just checked every step. So it must be happening on the target system I am posting to...I must not be sending the right thing and the API is decoding it wrong and adding characters. — richard, Feb 25 '13 at 07:42
@Richard You'd probably also benefit from reading the aforelinked article. In any case though, it'd be helpful if you could post the data as it is during the various steps of your program. `echo bin2hex($string)` also helps to debug what encoding the text is *actually* in at any given time. — deceze, Feb 25 '13 at 07:43
Hi deceze...the characters don't change...my php code posts what it gets. I just checked that. Here's what I get: Â© 2012 H — richard, Feb 25 '13 at 07:45
Here's what I post (it's urlencoded): %26%23194%3B%26%23169%3B+2012+H — richard, Feb 25 '13 at 07:46
If `Â©` is supposed to represent "©", then it's a totally screwed up encoding. Two entities representing one character means the encoding was already messed up by the sender, likely a multi-byte string was HTML encoded as if it were in a single-byte encoding. So you now need to HTML-decode the text to a single byte encoding (likely ISO-Latin1) and then treat the string as if it were multi-byte encoded (likely UTF-8). — deceze, Feb 25 '13 at 07:50
Yes, it is screwed up. The extra character is PART of the copyright symbol. If I delete the copyright symbol, it gets rid of the extra character, but that's the ONLY way! — richard, Feb 25 '13 at 07:52
Can you help me (ie with some code :-) ) on how to do as you are suggesting? — richard, Feb 25 '13 at 07:52

score 2 · Answer 1 · edited May 23 '17 at 12:19

2

To elaborate on my comment:

$screwed = '&#194;&#169;';

echo html_entity_decode($screwed, ENT_COMPAT, 'ISO-8859-1');

This returns "©", decoding the screwed up multi-single-byte-HTML encoding back into UTF-8 encoded text. So from here you just need to treat the text as if it were UTF-8 encoded (which it is now).

edited May 23 '17 at 12:19

Community

1
1

answered Feb 25 '13 at 08:04

deceze

510,633
85
743
889

Thanks, I was trying the decodes to figure out when it was screwed, I hadn't realized at first that it was the same throughout. – Jon Feb 25 '13 at 08:07
Ok I will try this and let you know! – richard Feb 25 '13 at 08:33

utf-8 character set, 7bit encoding, PHP adding strange characters

1 Answers1