CURL import character encoding problem

Question

I'm using CURL to import some code. However, in french, all the characters come out funny. For example: BonjourÂ ...

I don't have access to change anything on the imported code. Is there anything I can do my side to fix this?

Thanks

"Your situation is unclear. Where does PHP come in? Is the content you're downloading PHP code? What are you using to view the text afterwards?" -from Jon Skeet's answer below — David J., Jan 03 '11 at 01:05
Your situation is unclear. Where does PHP come in? Is the content you're downloading PHP code? What are you using to view the text afterwards? It's almost certainly just a case of handling the downloaded data in the appropriate encoding. However, you'll need to know what encoding that is (look at the HTTP headers for a possible hint, although it may not have been set correctly) and how to *use* the right encoding. We can't help you on the latter point until we know what you're doing with the data after fetching it. — Jon Skeet, Mar 16 '09 at 07:47

score 14 · Accepted Answer · answered Mar 16 '09 at 10:22

Like Jon Skeet pointed it's difficult to understand your situation, however if you have access only to final text, you can try to use iconv for changing text encoding.

I.e.

$text = iconv("Windows-1252","UTF-8",$text);

I've had similar issue time ago (with Italian language and special chars) and I've solved it in this way.

Try different combination (UTF-8, ISO-8859-1, Windows-1252).

score 7 · Answer 2 · answered Aug 15 '13 at 04:44

I had a similar problem. I tried to loop through all combinations of input and output charsets. Nothing helped! :(

However I was able to access the code that actually fetched the data and this is where the culprit lied. Data was fetched via cURL. Adding

 curl_setopt($ch,CURLOPT_BINARYTRANSFER,true);

fixed it.

A handy set of code to try out all possible combinations of a list of charsets:

$charsets = array(  
        "UTF-8", 
        "ASCII", 
        "Windows-1252", 
        "ISO-8859-15", 
        "ISO-8859-1", 
        "ISO-8859-6", 
        "CP1256"
        ); 

foreach ($charsets as $ch1) { 
    foreach ($charsets as $ch2){ 
        echo "<h1>Combination $ch1 to $ch2 produces: </h1>".iconv($ch1, $ch2, $text_2_convert); 
    } 
}

score 3 · Answer 3 · answered Apr 07 '09 at 10:02

3

PHP seems to use UTF-8 by default, so I found the following works

$text = iconv("UTF-8","Windows-1252",$text);

answered Apr 07 '09 at 10:02

score 3 · Answer 4 · answered Mar 04 '12 at 20:03

3

You could replace your

$data = curl_exec($ch);

by

$data = utf8_decode(curl_exec($ch));

I had this same issue and it worked well for me.

answered Mar 04 '12 at 20:03

Ben

55
5

1

IMPORTANT: when converting UTF8 data that contains the EURO sign DON'T USE utf_decode function. utf_decode converts the data into ISO-8859-1 charset. But ISO-8859-1 charset does not contain the EURO sign, therefor the EURO sign will be converted into a question mark character '?' In order to convert properly UTF8 data with EURO sign you must use: iconv("UTF-8", "CP1252", $data) – Thoman Apr 06 '12 at 16:58

score 2 · Answer 5 · edited Oct 21 '18 at 10:35

I'm currently suffering a similar problem, i'm trying to write a simple html <title> importer cia cURL. So i'm going to give an idea of what i've done until now:

Retrieve the HTML via cURL
Check if there's any hint of encoding on the response headers via curl_getinfo() and match it via regex
Parse the HTML for the purpose of looking at the content-type meta and the <title> tag (yes, i know the consequences)
Compare both content-type, header and meta and choose the meta one if it's different, because we know noone cares about their httpd configuration and there are a lot of dirt workarounds using it
iconv() the string
Whish everyday that when someone does not follow the standards $DEITY punishes him/her until the end of the days, because it would save me the meta parsing

CURL import character encoding problem

5 Answers5

Linked