5

I'm using CURL to import some code. However, in french, all the characters come out funny. For example: Bonjour ...

I don't have access to change anything on the imported code. Is there anything I can do my side to fix this?

Thanks

David J.
  • 31,569
  • 22
  • 122
  • 174
  • "Your situation is unclear. Where does PHP come in? Is the content you're downloading PHP code? What are you using to view the text afterwards?" -from Jon Skeet's answer below – David J. Jan 03 '11 at 01:05
  • Your situation is unclear. Where does PHP come in? Is the content you're downloading PHP code? What are you using to view the text afterwards? It's almost certainly just a case of handling the downloaded data in the appropriate encoding. However, you'll need to know what encoding that is (look at the HTTP headers for a possible hint, although it may not have been set correctly) and how to *use* the right encoding. We can't help you on the latter point until we know what you're doing with the data after fetching it. – Jon Skeet Mar 16 '09 at 07:47

5 Answers5

14

Like Jon Skeet pointed it's difficult to understand your situation, however if you have access only to final text, you can try to use iconv for changing text encoding.

I.e.

$text = iconv("Windows-1252","UTF-8",$text);

I've had similar issue time ago (with Italian language and special chars) and I've solved it in this way.

Try different combination (UTF-8, ISO-8859-1, Windows-1252).

Alekc
  • 4,682
  • 6
  • 32
  • 35
7

I had a similar problem. I tried to loop through all combinations of input and output charsets. Nothing helped! :(

However I was able to access the code that actually fetched the data and this is where the culprit lied. Data was fetched via cURL. Adding

 curl_setopt($ch,CURLOPT_BINARYTRANSFER,true);

fixed it.

A handy set of code to try out all possible combinations of a list of charsets:

$charsets = array(  
        "UTF-8", 
        "ASCII", 
        "Windows-1252", 
        "ISO-8859-15", 
        "ISO-8859-1", 
        "ISO-8859-6", 
        "CP1256"
        ); 

foreach ($charsets as $ch1) { 
    foreach ($charsets as $ch2){ 
        echo "<h1>Combination $ch1 to $ch2 produces: </h1>".iconv($ch1, $ch2, $text_2_convert); 
    } 
} 
Rid Iculous
  • 3,696
  • 3
  • 23
  • 28
3

PHP seems to use UTF-8 by default, so I found the following works

$text = iconv("UTF-8","Windows-1252",$text);

3

You could replace your

$data = curl_exec($ch);

by

$data = utf8_decode(curl_exec($ch));

I had this same issue and it worked well for me.

Ben
  • 55
  • 5
  • 1
    IMPORTANT: when converting UTF8 data that contains the EURO sign DON'T USE utf_decode function. utf_decode converts the data into ISO-8859-1 charset. But ISO-8859-1 charset does not contain the EURO sign, therefor the EURO sign will be converted into a question mark character '?' In order to convert properly UTF8 data with EURO sign you must use: iconv("UTF-8", "CP1252", $data) – Thoman Apr 06 '12 at 16:58
2

I'm currently suffering a similar problem, i'm trying to write a simple html <title> importer cia cURL. So i'm going to give an idea of what i've done until now:

  1. Retrieve the HTML via cURL
  2. Check if there's any hint of encoding on the response headers via curl_getinfo() and match it via regex
  3. Parse the HTML for the purpose of looking at the content-type meta and the <title> tag (yes, i know the consequences)
  4. Compare both content-type, header and meta and choose the meta one if it's different, because we know noone cares about their httpd configuration and there are a lot of dirt workarounds using it
  5. iconv() the string
  6. Whish everyday that when someone does not follow the standards $DEITY punishes him/her until the end of the days, because it would save me the meta parsing
Cœur
  • 37,241
  • 25
  • 195
  • 267
rmontagud
  • 153
  • 2
  • 12