0

On my website (hobby project, not commercial) I have to manage around 8 different translations, that are provided by volunteers. I manage these translations in Google Sheets. These sheets are saved as CSV and read out by PHP whenever it encounters a string in need of translation. So far, it works flawless! :D [ me happy ]

To update these sheets, I have written a PHP routine that collects all the strings-to-be-translated and puts them in a CSV. I then import this CSV into Google Sheets again.

This works beautifully, except for a couple of characters per language. For example, in portuguese, the 'à' displays as the notorious lozenge-with-a-question-mark symbol,

portuguese - wrongportuguese - right

Also chinese goes fine(!), except for one character:

chinese - wrong chinese - right

I know there are many questions and answers on this subject here at Stackoverflow. After reading these questions, I discovered that my files were intitially written out as "Western MAC OS encoding". Now I added some BOM characters, and indeed, TextWrangler recognizes it as 'UTF 8', but warns me that the file is corrupt. Indeed, the suspicious characters also don't display well in TextWrangler either.

I also see references to a function called 'iconv', but that doesn't seem to have no influence.

I have the feeling I mis a crucial step. Would you people mind having a look at a piece of code and help me futher?

      // Write the translations to a CSV file

  $fp = fopen("languages/gsheet_$language.csv", 'w');
  fwrite($f, pack("CCC",0xef,0xbb,0xbf));   

  write_csv($fp, array('key','translation','notes'));

  foreach($rows as $cols){
    array_walk($cols, "convert");
    write_csv($fp,$cols);
  }

 fclose($fp);   


} 

function write_csv($fp,$row){
  foreach($row as $key => $value){
    $row[$key] = "\"$value\"";
  }
  fwrite($fp,implode(",",$row));
  fwrite($fp,"\n");
}

The post UTF-8 all the way through is very informative, but has a strong focus on the database. However, I found the origin of the problem to be in the fgetcsv function: http://php.net/manual/en/function.fgetcsv.php#96049

Community
  • 1
  • 1
Ideogram
  • 1,265
  • 12
  • 21
  • You can't just say "This is UTF-8" if it isn't! It's like changing the cover of a French novel and pretend it's in English... – Álvaro González Nov 06 '15 at 09:38
  • @ÁlvaroGonzález The moderator linked through to the [duplicate](http://stackoverflow.com/questions/279170/utf-8-all-the-way-through) question at the top of your question. – cfreear Nov 06 '15 at 09:38
  • Indeed he did! Thanks. Regarding your first comment: the original CSV ffrom Google Sheets is reported by TextWrangler to be UTF-8. Or is this not sufficient? – Ideogram Nov 06 '15 at 09:41
  • A file is nothing but a bunch of zeros and ones. There isn't a 100% reliable machine algorithm to determine its encoding. Nothing can really replace opening the file with a given encoding and using human eyes to determine if it looks correct. – Álvaro González Nov 06 '15 at 09:57
  • I would like ask for re-opening, because the answer I found isn't mentioned in the origianl answer. This comment from PHP documentation gives the answer: php.net/manual/en/function.fgetcsv.php#96049 – Ideogram Nov 07 '15 at 06:55
  • Are you positively sure that your file is using `UTF-16`? That format uses a BOM character so any decent text editor would automatically recognise it as such and never as UTF-8 :-? – Álvaro González Nov 11 '15 at 09:48
  • Thanks for the follow-up! I've been able to solve the problem. Now it's solved, I could even remove the BEM and it stayed UTF8 (as reported by TextWrangler). The problem was caused by a reg-exp that 'amputated' some characters; it treated the strings as if they were not UTF-8. I removed the reg-exp and made a custom PHP function for the CSV-reading-and-writing and now everything works. – Ideogram Nov 11 '15 at 10:43

0 Answers0