php text encoding when GETting a webpage and then POSTing contents

Question

I'm trying to GET a webpage parse a part of it and then POST it as a value. The problem is: when there is a character as ó, I retrieve Ã³, and thus when posting it, the urlencode translation converts those characters to something completely different, which doesn't work.

More precisely, Ã³ is produced when an ó in utf-8 is interpreted as it was in ISO-9959-1, or at least that's what my browser does, if I set to view the page in utf-8 then I see ó, if I set the browser to view the page in ISO-9959-1 then I see Ã³, other encodings produce different symbols.

I tried to convert the results of the page, and also that specific string to utf-8, I did also try to set the headers to accept only utf-8, but that is not working either. I'm quite certain that is the problem but I'm running out of ideas. I changed the configuration in php.ini but maybe I did not restart yet, basically this is like shooting in the dark, and some help would be greatly appreciated.

If this helps: The specific code is here: https://github.com/trylks/golem/blob/master/php/copperGolem.php

The method is "form", when obtaining one of the parameter values from a previously obtained page with GET.

Thank you.

PD solved: I've been working on this for the last few hours, I can't tell if I changed many other things that are necessary. In any case, the last change that made it work was changing line 60 to be this: $dom->loadHTML(mb_convert_encoding($p, 'html-entities', mb_detect_encoding($p))); That made it. The problem is not libcurl but DomDocument, as explained here: PHP DomDocument failing to handle utf-8 characters (☆)

Does this help? [Handling Unicode Front To Back In A Web App](http://kunststube.net/frontback/) — deceze, Mar 29 '13 at 23:17
See: http://stackoverflow.com/questions/649480/curl-import-character-encoding-problem — Will B., Mar 29 '13 at 23:17
@deceze updated to proper URL, odd browser copied wrong link >> — Will B., Mar 29 '13 at 23:20
You said you tried to convert the page and the specific string to UTF-8 and force UTF-8 encoding via headers. Can you show how you did this? — George Reith, Mar 29 '13 at 23:23
headers are in line 20, and the conversion in line. I've tried to use `mb_convert_encoding` with the retrieved page and also with the parameters before `urlencode`. I think I tried with utf-8 and ISO-9959-1 but I'm going to double check before trying to install something additional like `iconv`. — Trylks, Mar 30 '13 at 01:16
* and the conversion in line 38 and 59, I'm still trying to figure out what's wrong, so I've used that in some other places, none of them worked. — Trylks, Mar 30 '13 at 01:30

score 0 · Accepted Answer · edited May 23 '17 at 12:28

The problem is in the DomDocument, it doesn't properly handle utf-8. Converting to html-entities is the safest option and it works like magic when outputting these characters back with echo (even using cli) or urlencoding these characters. Basically DomDocument doesn't accept utf-8 but it outputs utf-8, or so it seems. So it's a weird conversion that has to be made, so that DomDocument undoes it and everything is back to normal again.

To do this, and being $dom a DomDocument it's enough to do this on every call to $dom->loadHTML($p):

$dom->loadHTML(mb_convert_encoding($p, 'html-entities', mb_detect_encoding($p)));

This is explained better in this other question: PHP DomDocument failing to handle utf-8 characters (☆)

php text encoding when GETting a webpage and then POSTing contents

1 Answers1