1

I have a .tsv file using Danish letters like Æ Ø Å. The file is uploaded in php with file_get_contents(); and then processed and made to an mysqli query.

I tried putting <?php header('Content-Type: text/html; charset=utf-8'); ?> at the very top of the code. also using the meta tag <meta charset="UTF-8">

and in my SQL I have the rows created like:

text COLLATE utf8_danish_ci NOT NULL

and:

PRIMARY KEY (`id`)\n) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_danish_ci AUTO_INCREMENT

and:

$conn->set_charset("utf8");

.... But still no luck.

If I open my .tsv file in excel, then it shows the Æ Ø Å correctly. But when open with "TextEdit" on mac. the "Æ Ø Å" shows like "¯ ¯ ¯"

UPDATE - SOLUTION as the accepted answer refers to I should be using CP1252:

mb_convert_encoding($fileEndEnd, 'HTML-ENTITIES', "CP1252");
Jonas Borneland
  • 383
  • 1
  • 6
  • 19
  • have you tried this `mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));` were $content is file_get_content output? – zebnat Oct 01 '18 at 19:42
  • try with CHARSET=utf8mb4 in database. (mysqli utf-8 isn't a full utf-8 charset. I don't know of the dansk are included.) – Jeff Oct 01 '18 at 19:45
  • The text editor you are using to create `.tsv` -- does this have UTF-8 enabled? Software like Notepad++ or by M$ is notoriously difficult in this area. – HoldOffHunger Oct 01 '18 at 19:46
  • 2
    Possible duplicate of [UTF-8 all the way through](https://stackoverflow.com/questions/279170/utf-8-all-the-way-through) –  Oct 01 '18 at 19:47
  • @zebnat Not sure if I do it right... se my update in the question – Jonas Borneland Oct 01 '18 at 19:49
  • better follow @IdontDownVote link. ^^ – zebnat Oct 01 '18 at 19:50
  • @MagnusEriksson I can´t change the original file, as it this is ment to be an converter of files from another system. And I also don´t know is the original is saved as utf8 - but I guess so as excel reads it fine – Jonas Borneland Oct 01 '18 at 19:51

1 Answers1

2

There are many things to consider with UTF-8. But I see this one particular comment of yours...

If I open my .tsv file in excel, then it shows the Æ Ø Å correctly. But when open with "TextEdit" on mac. the "Æ Ø Å" shows like "¯ ¯ ¯"

The problem...

If you are talking about MicroSoft Excel, then you should know that the characters above are both within the UTF-8 charset and the LATIN_1_SUPPLEMENT charset (often called CP1252). Take a look: LATIN_1_SUPPLEMENT Block

If you are saving this document, without setting an encoding of it to UTF-8, then Windows will have no reason to convert this text out of CP1252 and into UTF-8. But that is what you will need to do.

Possible solutions...

On your server: You can try to decode any windows charset or "unknown" charset from CP1252 to UTF-8. (Since Windows will save documents "according to the system default", this information may disappear by the time it hits your Linux servers.)

On the submitter's computer: You can solve this by having the user adjust their UTF-8 settings in whatever editor is generating the document (to encode their documents as UTF-8, which causes this information to be stored in the document BOM, or "byte-order mark", which your server can read). This second approach may seem user-unfriendly (and it is, sure), but it can help you identify where the data is being corrupted.

HoldOffHunger
  • 18,769
  • 10
  • 104
  • 133
  • Thanks!! This sounds promising... Can I use `mb_convert_encoding` for that and do the conversion to UTF-8 there? – Jonas Borneland Oct 01 '18 at 19:58
  • The problem partially has to do with what the user is uploading. Excel may save the charset as "the local system" charset (a windows system), but you have a Linux/Apache system reading it. I would try `mb_convert_encoding()`, if and only if mb_detect_encoding is cp1252, 'iso-8859-1', or (most likely, as I have seen) "unknown". – HoldOffHunger Oct 01 '18 at 20:02
  • I´ve been searching and trying to do it... Is this the correct way to use mb_convert_encoding? $toconvert = file_get_contents($inputFile); mb_convert_encoding($toconvert, 'CP1252', 'UTF-8'); $this->tsv->content = $toconvert; – Jonas Borneland Oct 01 '18 at 20:22
  • @JonasB : No worries, I think you have your to/from charsets in the wrong order. ('CP1252' should be the "from" charset, 'UTF8' should be the "to" charset). – HoldOffHunger Oct 01 '18 at 20:23
  • Thanks!!... Hmm no luck yet. both tried that and `mb_convert_encoding($toconvert, 'UTF-8', 'unknown');` – Jonas Borneland Oct 01 '18 at 20:25
  • @JonasB: You should try "CP1252" where you have "unknown". Without charset encoding, linux will think it is unknown, but since you know it's made with Excel, you should try with "Cp1252" (or "ISO-...", etc.). – HoldOffHunger Oct 01 '18 at 20:37
  • Thanks! It´s NOT made with excel thought. I just tried to open it in excel to see if the letters would show up... It is a file downloaded from a website, so it might even be generated from PHP :) – Jonas Borneland Oct 01 '18 at 20:44
  • OHH yes!! found the solution. Your answer, mixed with https://stackoverflow.com/questions/2236668/file-get-contents-breaks-up-utf-8-characters .... The final solution was `mb_convert_encoding($fileEndEnd, 'HTML-ENTITIES', "CP1252");` – Jonas Borneland Oct 01 '18 at 21:28
  • @JonasB: Nice. =) Rock on. – HoldOffHunger Oct 01 '18 at 21:29