0

I need to get content of the remote file in utf-8 encoding. The file in in utf-8. When I display that file on screen, it has proper encoding:

http://www.parfumeriafox.sk/source_file.html

(notice the ň and č characters, for example, these are alright).

When I run this code:

<?php

$url = 'http://parfumeriafox.sk/source_file.html';

$csv = file_get_contents_utf8($url);
header('Content-type: text/html; charset=utf-8');
print $csv;

function file_get_contents_utf8($fn) {
  $content = file_get_contents($fn);
  return mb_convert_encoding($content, 'utf-8');
}

(you can run it using http://www.parfumeriafox.sk/encoding.php), then I get question marks instead of those special characters. I have done huge research on this, I have tried standard file_read_contents function, I have even used some stream bla bla php context function, I also tried fopen and fread function to read that file on binary level, nothing seems to work. I have tried that with and without sending header. This is supposed to be perfectly siple, what am I doing wrong? When I check that string with some encoding detect function, it returns UTF-8.

Funk Forty Niner
  • 74,450
  • 15
  • 68
  • 141
petiar
  • 1,047
  • 2
  • 14
  • 31

2 Answers2

3

You can see which character set your browser decided the document was by opening the developer console and looking at document.characterSet:

> document.characterSet
"windows-1250"

With this knowledge we can ask iconv to convert from "windows-1250" to utf-8 for us:

<?php
$text = file_get_contents("source_file.csv");
$text = iconv("windows-1250", "utf-8", $text);
print($text);

The output is valid utf-8, and levanduľa is displayed correctly as well.

MatsLindh
  • 49,529
  • 4
  • 53
  • 84
1

How about this one????

For this one I used header('Content-Type: text/plain;; charset=Windows-1250');

bergamot, citrón, tráva, rebarbora, bazalka;levanduľa, škorica, hruška;céderové drevo, vanilka, pižmo, amberlyn


enter image description here


This code works for me

<?php
header('Content-Type: text/plain;charset=Windows-1250');
echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
?>


The problem is not with file_get_contents()

I save the $data to a file and the characters were correct but still not encoded correctly by my text editor. See image below.

$data = file_get_contents('http://www.parfumeriafox.sk/source_file.html');
file_put_contents('doc.txt',$data);

UPDATE

Seems to be one problematic character as shown here. It also is seen on the HTML image below. Renders as ¾

Its Hex value is xBE (190 decimal)

I tried these two character sets. Neither worked.

header('Content-Type: text/plain; charset=ISO 8859-1');
header('Content-Type: text/plain; charset=ISO 8859-2');



enter image description here


END OF UPDATE


It works by adding a header WITHOUT charset=utf-8.

These two headers work

header('Content-Type: text/plain');
header('Content-Type: text/html');

These two headers do NOT work

header('Content-Type: text/plain; charset=utf-8');
header('Content-Type: text/html; charset=utf-8');

This code is tested and displayed all characters.

<?php
header('Content-Type: text/plain');
echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
?>

enter image description here

<?php
header('Content-Type: text/html');
echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
?>

enter image description here



These are some of the problematic characters with their Hex values.
This is the saved file viewed in Notepad++ with UTF-8 Encoding.

enter image description here

Check the Hex values against these character sets.

enter image description here

From the above table I saw the character set was Latin2.

I went to Wikipedia Windows code page and found that Latin2 is Windows-1250


bergamot, citrón, tráva, rebarbora, bazalka;levanduľa, škorica, hruška;céderové drevo, vanilka, pižmo, amberlyn

Misunderstood
  • 5,534
  • 1
  • 18
  • 25
  • No, it does not, I can see "è" where it should read "č", "ò" instead of "ň", etc... – petiar Nov 19 '17 at 15:15
  • I would not catch those. I was able to find the one character. Well it improved. I found the on character after I posted. You need to know what character encoding is being used then add that character set to the header(). This link might help: https://docs.oracle.com/cd/B10501_01/server.920/a96529/ch2.htm – Misunderstood Nov 19 '17 at 15:22
  • I do not believe the problem is with file_get_contents(), I updated my post with more info. – Misunderstood Nov 19 '17 at 15:42
  • Thanks, nice research. It works now with the Windows encoding, which is funny because the party which sends the file keep saying it's in UTF-8 encoding, which probably is not. Anyway, where did I tell there was a problem with the file_get_contents() functions? – petiar Nov 20 '17 at 02:33