4

I can't figure out what I'm doing wrong. I'm getting file content from the database. When I echo the content, everything displays just fine, when I write it to a file (.html) it breaks. I've tried iconv and a few other solutions, but I just don't understand what I should put for the first parameter, I've tried blanks, and that didn't work very well either. I assume it's coming out of the DB as UTF-8 if it's echoing properly. Been stuck a little while now without much luck.

function file($fileName, $content) {
    if (!file_exists("out/".$fileName)) {
        $file_handle = fopen(DOCROOT . "out/".$fileName, "wb") or die("can't open file");
        fwrite($file_handle, iconv('UTF-8', 'UTF-8', $content));
        fclose($file_handle);
        return TRUE;
    } else {
        return FALSE;
    }
}

Source of the html file looks like.

Comes out of the DB like this:

<h5>Текущая стабильная версия CMS</h5>

goes in file like this

<h5>Ð¢ÐµÐºÑƒÑ‰Ð°Ñ ÑÑ‚Ð°Ð±Ð¸Ð»ÑŒÐ½Ð°Ñ Ð²ÐµÑ€ÑÐ¸Ñ CMS</h5>

EDIT:

Turns out the root of the problem was Apache serving the files incorrectly. Adding

AddDefaultCharset utf-8

To my .htaccess file fixed it. Hours wasted... At least I learned something though.

Serhiy
  • 2,505
  • 3
  • 33
  • 49
  • Put your output HTML somewhere. Chances are you have not added the UTF-8 tag header to the HTML. – Dean Nov 19 '15 at 22:01
  • @Dean would that mater even when looking at the source?
  • Ðа форуме чаÑто упоминаетÑÑ ÐºÐ¾Ð´Ð¾Ð³ÐµÐ½ÐµÑ€Ð°Ñ‚Ð¾Ñ€.
  • – Serhiy Nov 19 '15 at 22:04
  • @Serhiy can you add the code in your comment at the bottom of your original post? comment codes/scramble is hard enough to read at the best of times, cheers – Martin Nov 19 '15 at 22:36
  • can you post a couple of comparison codeblocks of how the test looks on your output from database to how the text looks on your dump into the file? – Martin Nov 19 '15 at 22:45
  • 1
    I bet that Ð¢ÐµÐºÑƒÑ‰Ð°Ñ ÑÑ‚Ð°Ð±Ð¸Ð»ÑŒÐ½Ð°Ñ Ð²ÐµÑ€ÑÐ¸Ñ is the UTF-8 of Текущая стабильная версия seen as ISO-8859-1, also because the character Ð is 0xD0 in ISO-8859-1, and 0xD0 is the first of the 2 UTF-8 bytes of the letters of the cyrillic alphabet. This means that you ARE writing UTF-8, but are looking at it as if it were ISO-8859-1 (or ISO-8859-15). – Walter Tross Nov 19 '15 at 23:38
  • ahhhh, I see your edit, ok - so then it's the character encoding of the source code reader that needs to be corrected... – Martin Nov 19 '15 at 23:40
  • @Walter Tross my god, I think we're getting close, when I dive into Putty and cat the file, it looks ok, so is an Apache issue? and where would I even begin with that... – Serhiy Nov 19 '15 at 23:48
  • @Martin this just gets better and better, in chrome everything appears fine, in IE and firefox it's scrambled. – Serhiy Nov 19 '15 at 23:50
  • Man ohh man was this ever frustrating. @WalterTross thank you for pointing me in the right direction. Martin thank you so much for all the debugging help and helping me eliminate all the possible problems. I feel like I should be paying you, you offered me so much help. – Serhiy Nov 19 '15 at 23:55
  • did you get a solution? - I have no immediate ideas about firefox (etc) scrambling the encoding.... maybe revert back to what Dean said and add a encoding type header. – Martin Nov 19 '15 at 23:55
  • @Martin yep, just put it in the edit, thank you so much. – Serhiy Nov 19 '15 at 23:56
  • @Martin if you want to integrate it into your answer, I will accept it, since you've been so helpful. – Serhiy Nov 19 '15 at 23:56
  • ahhhh `.htaccess` . Really pleased to get a solution to this, it was confusing me. I don't want to take the credit, you discovered this yourself. Just mark up (+1) my answer ;-) – Martin Nov 19 '15 at 23:58
  • Does the `.htaccess` issue also correct the FF / IE browsers, too? – Martin Nov 19 '15 at 23:59
  • @Martin yep, all 3 browsers work perfectly now – Serhiy Nov 20 '15 at 00:02
  • awesome! As I say, it's satisfying as a helper trying to work out what's going on to discover (be told) the solution. And just in time, now I need to disappear! Finally, though, what made you think of adding that line to .htaccess? – Martin Nov 20 '15 at 00:04
  • 1
    @Martin The Walter Tross comment you pointed me towards. He said it's the reader. So after I outputed the file via cat, I got good results, and figured my reader, aka Apache, was at fault. Next thing I did was google Apache and UTF-8 and got http://stackoverflow.com/questions/913869/how-to-change-the-default-encoding-to-utf-8-for-server – Serhiy Nov 20 '15 at 00:16
  • @Martin Should this be of use, finished the thing I was doing https://github.com/3rdcupofjava/scraping_old_kohana_forum all the code is actually in this file https://github.com/3rdcupofjava/scraping_old_kohana_forum/blob/master/application/classes/Controller/Main.php – Serhiy Nov 20 '15 at 02:19
  • 1
    Glad I could help. I have to correct my comment though: Characters in the cyrillic block of Unicode start with one of 0xD0, 0xD1, 0xD2, 0xD3 when encoded as UTF-8 (because they are in the range U+0400 to U+04FF). These 4 start bytes appear as Ð, Ñ, Ò and Ó in ISO-8859-1 (with decreasing frequency). – Walter Tross Nov 20 '15 at 07:16