8

I am importing contents from an Excel-generated CSV-file into an XML document like:

$csv = fopen($csvfile, r);
$words = array();

while (($pair = fgetcsv($csv)) !== FALSE) {
    array_push($words, array('en' => $pair[0], 'de' => $pair[1]));
}

The inserted data are English/German expressions.

I insert these values into an XML structure and output the XML as following:

$dictionary = new SimpleXMLElement('<dictionary></dictionary>');
//do things
$dom = dom_import_simplexml($dictionary) -> ownerDocument;
$dom -> formatOutput = true;

header('Content-encoding: utf-8'); //<3 UTF-8
header('Content-type: text/xml'); //Headers set to correct mime-type for XML output!!!!

echo $dom -> saveXML();

This is working fine, yet I am encountering one really strange problem. When the first letter of a String is an Umlaut (like in Österreich or Ägypten) the character will be omitted, resulting in gypten or sterreich. If the Umlaut is in the middle of the String (Russische Föderation) it gets transferred correctly. Same goes for things like ß or é or whatever.

All files are UTF-8 encoded and served in UTF-8.

This seems rather strange and bug-like to me, yet maybe I am missing something, there's a lot of smart people around here.

m90
  • 11,434
  • 13
  • 62
  • 112
  • Is ucfirst() being used at all? I seem to recall issues with characters with diacritic marks, and having to use mb_convert_case() on the first character instead. – jornak Sep 12 '12 at 14:56
  • @jornak The values are "correctly capitalized" in the CSV file so I didn't think I'll have to mess with that, but I'll give it a try. – m90 Sep 12 '12 at 14:59
  • 1
    Why don't you start those words with a prefix? Then remove it after adding those words. – Luigi Siri Sep 12 '12 at 15:22
  • @m90 check this out, http://stackoverflow.com/questions/1571626/simplyxml-and-accented-characters-in-php – FirmView Sep 13 '12 at 01:57

5 Answers5

4

Ok, so this seems to be a bug in fgetcsv.

I am now processing the CSV data on my own (a little cumbersome), but it is working and I do not have any encoding issues at all.

This is (a not-yet-optimized version of) what I am doing:

$rawCSV = file_get_contents($csvfile);

$lines = preg_split ('/$\R?^/m', $rawCSV); //split on line breaks in all operating systems: http://stackoverflow.com/a/7498886/797194

foreach ($lines as $line) {
    array_push($words, getCSVValues($line));
}

The getCSVValues is coming from here and is needed to deal with CSV lines like this (commas!):

"I'm a string, what should I do when I need commas?",Howdy there

It looks like:

function getCSVValues($string, $separator=","){

    $elements = explode($separator, $string);

    for ($i = 0; $i < count($elements); $i++) {
        $nquotes = substr_count($elements[$i], '"');
        if ($nquotes %2 == 1) {
            for ($j = $i+1; $j < count($elements); $j++) {
                if (substr_count($elements[$j], '"') %2 == 1) { // Look for an odd-number of quotes
                    // Put the quoted string's pieces back together again
                    array_splice($elements, $i, $j-$i+1,
                        implode($separator, array_slice($elements, $i, $j-$i+1)));
                    break;
                }
            }
        }
        if ($nquotes > 0) {
            // Remove first and last quotes, then merge pairs of quotes
            $qstr =& $elements[$i];
            $qstr = substr_replace($qstr, '', strpos($qstr, '"'), 1);
            $qstr = substr_replace($qstr, '', strrpos($qstr, '"'), 1);
            $qstr = str_replace('""', '"', $qstr);
        }
    }
    return $elements;

}

Quite a bit of a workaround, but it seems to work fine.

EDIT:

There's a also a filed bug for this, apparently this depends on the locale settings.

m90
  • 11,434
  • 13
  • 62
  • 112
  • 1
    if it is a bug in a PHP function, you should probably report it on php.net. However, this may also help you: http://static.zend.com/topics/multibyte-fgetcsv.pdf – SDC Sep 13 '12 at 08:57
  • 1
    I can confirm it probably is a bug in PHP as I tried any sensible method of debugging: http://stackoverflow.com/questions/16653369/str-getcsv-is-broken-clips-diacritics – Josef Sábl May 20 '13 at 16:00
  • @JosefSábl I encountered this quite a few times by now and found that setting the correct locale setting via `setlocale(LC_ALL,'de_DE.UTF-8')` (german in my case) seems to "patch" that quite fine. – m90 May 20 '13 at 16:08
3

If the string comes from Excel (I had problems with the letter ø disappearing if it was in the beginning of the string) ... then this fixed it:

setlocale(LC_ALL, 'en_US.ISO-8859-1');

Brian Langhoff
  • 151
  • 2
  • 3
  • 13
  • It all seems to be a locale issue: http://static.zend.com/topics/multibyte-fgetcsv.pdf – m90 Jun 28 '13 at 10:01
  • Had same issue when importing from .csv file which is generated from .xlsx. setlocale(LC_ALL, 'en_US.ISO-8859-1'); fixed it! – darjus Jul 04 '17 at 05:13
2

If other umlauts in the middle appear ok, then this is not a base encoding issue. The fact that it happens at the beginning of the line probably indicates some incompatibility with the newline mark. Perhaps the CSV was generated with a different newline encoding.

This happens when moving files between different OS:

  • Windows: \r\n (characters 13 and 10)
  • Linux: \n (character 10)
  • Mac OS: \r (character 13)

If I were you, I would verify the newline mark to be sure.

If in Linux: hexdump -C filename | more and inspect the document.

You can change the newline marks with a sed expression if that's the case.

Hope that helped!

felixgaal
  • 2,403
  • 15
  • 24
  • This sounds pretty reasonable, but unfortunately the german (umlaut-hell) expressions are not on the beginning of a new line, but in the middle. I just identified `fgetcsv` as the culprit as the content gets transferred quite fine when you just use sth like `file_get_contents`. – m90 Sep 13 '12 at 07:31
  • Ok, so now I am manually processing the file's contents and it is working just fine. I'll write an answer myself. Thanks for your input!! – m90 Sep 13 '12 at 08:13
2

A bit simpler workaround (but pretty dirty):

//1. replace delimiter in input string with delimiter + some constant
$dataLine = str_replace($this->fieldDelimiter, $this->fieldDelimiter . $this->bugFixer, $dataLine);

//2. parse
$parsedLine = str_getcsv($dataLine, $this->fieldDelimiter);

//3. remove the constant from resulting strings.
foreach ($parsedLine as $i => $parsedField)
{
    $parsedLine[$i] = str_replace($this->bugFixer, '', $parsedField);
}
Josef Sábl
  • 7,538
  • 9
  • 54
  • 66
0

Could be some sort of utf8_encode() problem. This comment on the documentation page seems to indicate if you encode an Umlaut when it's already encoded, it could cause issues.

Maybe test to see if the data is already utf-8 encoded with mb_detect_encoding().

taco
  • 1,367
  • 17
  • 32
  • I never do/did any encoding on the values. The problem was a bug in `fgetcsv`, see my answer for a workaround. – m90 Sep 13 '12 at 08:23