0

I'm working with a php array which contains some values parsed from a previous scraping process (using Simple HTML DOM Parser). I can normally print / echo the values of this array, which contains special chars é,à,è, etc. BUT, the problem is the following :

When I'm using fwrite to save values in a .csv file, some characters are not successfully saved. For example, Székesfehérvár is well displayed on my php view in HTML, but saved as Székesfehérvár in the .csv file which I generate with the php script above.

I've already set-up several things in the php script :

  • The page I'm scraping seems to be utf-8 encoded
  • My PHP script is also declared as utf-8 in the header
  • I've tried a lot of iconv and mb_encode methods in different places in the code
  • NOTE that when I'm make a JS console.log of my php array, using json_encode, the characters are also broken, maybe linked to the original encoding of the page I'm scraping?

Here's a part of the script, it is the part who is writing values in a .csv file

<?php 

$data = array(
            array("item1", "item2"), 
            array("item1", "item2"),
            array("item1", "item2"),
            array("item1", "item2")
            // ...
);

//filename
$filename = 'myFileName.csv';

foreach($data as $line) {
    $string_txt = ""; //declares the content of the .csv as a string
    foreach($line as $item) {
        //writes a new line of the .csv
        $line_txt = "";
        //each line of the .csv equals to the values of the php subarray, tab separated
        $line_txt .= $item . "\t";
    }

    //PHP endline constant, indicates the next line of the .csv
    $line_txt .= PHP_EOL;
    
    //add the line to the string which is the global content of the .csv
    $line_txt .= $string_txt;
}

//writing the string in a .csv file 
$file = fopen($filename, 'w+');
fwrite($file, $string_txt);
fclose($file);

I am currently stuck because I can't save values with accentuated characters correctly.

Maxime
  • 39
  • 1
  • 11
  • _“The page i'm scrapping seems to be utf-8 encoded”_ - it much rather seems, that the page you are scraping actually uses these numeric entities to represent these characters already. You probably just haven’t noticed, because you looked at your debug outputs _after_ the browser has interpreted them as HTML. `html_entity_decode` should help. – misorude Aug 26 '19 at 12:12
  • @misorude, thanks for your help. I don't really understand your comment, let me add that : when I do a `print_r` of my `$data` array, all the characters are availables, but the problem is when I try to do something else with this array, such as a `json_encode` for JS, or write in a .csv. Do you understand what I mean ? thx – Maxime Aug 26 '19 at 12:33
  • Do a `print_r("Székes");` - notice something? – misorude Aug 26 '19 at 12:35
  • Yes, the `print_r` return `Székes`. Following you, I use `htmlentities` to get the _original_ numeric entities of the values, but my question is now : How can I **store** the values as `Székes` for example, and not as `Székes` ? thx @misorude – Maxime Aug 26 '19 at 13:06
  • _“Yes, the `print_r` return `Székes`”_ - so do you understand my initial comment now then? _“How can I store the values as `Székes` for example, and not as `Székes` ?”_ - by making the value that you have, into the value that you want – you currently _have_ `Székes`. And no, I did not say to use `htmlentities`. – misorude Aug 26 '19 at 13:08
  • Sorry but while using `html_entity_decode`, I still can't write the right value in a file. exemple : `fwrite($file, html_entity_decode("Só"));` is writing `Só`, while `echo html_entity_decode("Só");` is echoing `Só`. Did I miss something in your explications ? sorry ... – Maxime Aug 26 '19 at 13:49
  • `ó` would be a numeric HTML character reference - `ó` is not. – misorude Aug 26 '19 at 13:53

4 Answers4

1

Put this line in your code

header('Content-Type: text/html; charset=UTF-8');

Hope this helps you!

Rajdip Chauhan
  • 345
  • 2
  • 11
  • Thank you, but it's already in the script, but I simplified the code part for the post. (also mentioned in the text: "My PHP script is also declared as utf-8 in the header"). thx – Maxime Aug 26 '19 at 11:01
  • Can you please check once by select encoding type: UTF-8 when you are going to open downloaded CSV file. – Rajdip Chauhan Aug 26 '19 at 11:34
1

Try it


$file = fopen('myFileName.csv','w');
$data= array_map("utf8_decode", $data);
fputcsv($file,$data);

dılo sürücü
  • 3,821
  • 1
  • 26
  • 28
0

Excel has problems displaying utf8 encoded csv files. I saw this before. But you can try utf8 BOM. I tried it and works for me. This is simply adding these bytes at the start of your utf8 string:

$line_txt .= chr(239) . chr(187) . chr(191) . $item . "\t";

For more info: Encoding a string as UTF-8 with BOM in PHP

Alternatively, you can use the file import feature in Excel and make sure the file origin says 65001 : Unicode(UTF8). It should display your text properly and you will need to save it as an Excel file to preserve the format.

jasonwubz
  • 341
  • 3
  • 6
0

The solution (provided by @misorude) :

When scraping HTML contents from webpages, there is a difference between what's displayed in your debug and what's really scraped in the script. I had to use html_entity_decode to let PHP interpret the true value of the HTML code I've scraped, and not the browser's interpretation.

To validate a good retriving of values before store them somewhere, you could try a console.log in JS to see if values are correctly drived :

PHP

//decoding numeric HTML entities who represents "Sóstói Stadion"
$b = html_entity_decode("S&#243;st&#243;i Stadion"); 

Javascript (to test):

<script>
var b = <?php echo json_encode($b) ;?>;

//print "Sóstói Stadion" correctly
console.log(b); 
</script>
Maxime
  • 39
  • 1
  • 11