0

I use CURL to get content from another site, but i don't know why it's auto convert from UTF-8 to ISO 8859-1, like follow:

site: abc.com:

Cửa Hàng Chip Chip: Rộn ràng đón Giáng sinh với những vật phẩm trang trí Noel đầy màu sắc của CHIPCHIP GIFT SHOP

But when i use CURL get content from that site, i got follow:

Cửa Hàng Chip Chip: Rộn ràng đón Giáng sinh với những vật phẩm trang trí Noel đầy màu sắc của CHIPCHIP GIFT SHOP

So how to convert it's become to UTF-8 ?

Manse
  • 37,765
  • 10
  • 83
  • 108
Phi Tống
  • 11
  • 3
  • 3
    Those are [character entity references](http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references). – sarnold Nov 24 '11 at 08:17
  • 2
    That can't be ISO-8859-1; you can't express all of those accents in that codepage. It's probably already UTF-8, just with some character entities like sarnold mentions. – Michael Madsen Nov 24 '11 at 09:28

6 Answers6

0

You can try this:

html_entity_decode($string)

See more here: html_entity_decode

סטנלי גרונן
  • 2,917
  • 23
  • 46
  • 68
0

Your files aren’t being converted to another encoding. They’re using HTML character entities. You need to convert those entities, such as é to UTF-8, such as é. This takes one extra line of code after you convert to UTF-8, if you even need to do that.

Davislor
  • 14,674
  • 2
  • 34
  • 49
0

I'd recommend using iconv.

iconv --list gives you a list of all known encodings, and you can then use iconv -f FROM_ENCODING -t TO_ENCODING do do your conversion. It can also read from stdin and therefore be plugged to curl.

But regarding the comment you got for your question: It seems like the file author didn't care about using the correct encoding and decided to stick with (old-style?) &auml and stuff.

wal-o-mat
  • 7,158
  • 7
  • 32
  • 41
  • i try use iconv, but i always get this error `Notice: iconv(): Detected an illegal character in input string in D:\UniServer\www\deal\haha.php on line 5 C` this is my code: `echo iconv("UTF-8", "ISO-8859-1", $text);` – Phi Tống Nov 24 '11 at 09:17
  • Sorry, partially my fault. You did not specify that you're working with PHP excpet for the php tag, so I was assuming you're working in the shell. – wal-o-mat Nov 24 '11 at 09:44
0

Take your string in variable and use following function.

$var = "";
echo utf8_encode($var);
0

Judging from the line you pasted, the problem appears to be with HTML entities, not with character enconding. The encoded chars look fine to me.

You need to translate those HTML entities to encoded chars. Which tool to use will depend of your enviroment or programming language. I don't think it can be done with CURL alone.

PHP has htmlspecialchars_decode(). Python unescape() from the HTMLParser module.

Community
  • 1
  • 1
AJJ
  • 7,365
  • 7
  • 31
  • 34
  • I forgot tell u, it display good for me, but when i convert it to ascii to make seo URL, i have problem with that. Example: `Rộn ràng đón Giáng sinh => ron rang don giang sinh (true)` My problem `Rộn ràng đón Giáng sinh => Rộn ragraveng đoacuten Giaacuteng sinh (wrong)` – Phi Tống Nov 24 '11 at 09:28
0

curl does not convert anything, downloads things "as is"

What you see are character entities, valid html, and the browser that the conversion to a readable form.

You can check this by opening the file saved by curl in a browser. It will look like the live page.

Mihai Nita
  • 5,547
  • 27
  • 27