-2

So I have an excel sheet that I have saved as CSV and encoded as UTF-8 with web options. When I originally did this it showed a lot of accented letters that I removed, this is odd as the content is a really simple language list. I've deleted all these characters and imported to SQL with CSV with LOAD DATA and utf8 character set. I've run the query 'SET NAMES utf8' and I've changed Collation of my database to utf8_general_ci in the operations menu of phpmyadmin. I've set

<meta charset="utf-8"/>

at the top of my webpage and I've set

<?php echo utf8_encode($row['lang']); ?>

in my webpage to echo a string of languages.

What I get is the a list that goes out of the div container. I've identified that some spaces seem to be able to drop to a new line and some don't therefore the browser seems to think the list is one long word rather then several separated. Anyone have any ideas?

Example data: "English, Spanish, French, Russian"

t1gor
  • 1,244
  • 12
  • 25
Alec Davies
  • 127
  • 11
  • Sorry, but your “example data” doesn’t help one bit, and your problem description is rather unclear as well. Suggest you first of all open your CSV in a HEX editor, to see what byte values you are actually dealing with in the places in question. – CBroe Aug 30 '17 at 11:43
  • 1
    also mysqls "utf8" is not really utf-8, but a 3-byte version. charsets are really annoying, setting the meta tag to utf-8 might not be enough, setting the charset in http headers might help. also, if you read the data as utf-8 from database you should not need to utf8_encode it again, somethings fishy, find out where! – Jakumi Aug 30 '17 at 12:47
  • @Jakumi Interestingly, when I remove the utf8_encode() function I get the spaces replaced with the diamond with a question mark. The space issue still persists. Sounds very similar to this post: https://stackoverflow.com/questions/7262687/weird-white-space-characters-utf8-php but I'm not technical enough to understand how to solve it – Alec Davies Aug 30 '17 at 13:08
  • 1
    Removing the accented letters (probably Mojibake) has destroyed the content. The probably was that some part of the process failed to specify utf8. – Rick James Aug 30 '17 at 17:27

1 Answers1

0

currently, your problem description is insufficient for reproducing your problem.

However, I'll suggest a common recipe that works most of the time to pinpoint the problem:

0. find an example in the output where the encoding is weird/strange/wrong

use this example in the following steps, to determine where the problem is located.

1. is the data in the file correct?

open the text file and see, if there is some odd characters. if there is, then that could be your problem. fix it.

2. is the data in the database correct?

open your database with phpmyadmin or some other tool that doesn't screw with encoding to display something misleading. check if there are odd characters. if there are, either your import or your table's / database's charset / collation is wrong. fix it.

3. is the charset in the http header correct?

use your browser developer tools to find the response's charset. also use the view -> text encoding (this would be in firefox, similar in other browsers) menu options to determine, if something changes when you switch to utf-8. this usually implies, that the headers encoding don't match the whole payload's encoding. setting the http header and the meta tag (usually ignored) charset might help.

4. ensure consistent data encoding

both the php script file and the data your db query returns should have the same encoding (preferably utf-8), especially if a literal string in your php file is output or if some other file with different encoding is output. this is not extremely simple to determine in some cases. but possibly some iconv shenanigans will shed some light on this, also file -i will help determine the encoding of the script file.

if all of this is in order, I would be out of ideas. if you have problems understanding some or all of these parts ... you should read up or ask someone else to do this.

Jakumi
  • 8,043
  • 2
  • 15
  • 32
  • OK thanks a lot for the response. So I've discovered that my file is in UTF-8 and when I switch it to ANSI I get these little 'Â' buggers coming up. I'm pretty certain these are what is messing up my text. When I find and replace them all with a null value, I switch back to UTF-8 and now they're little 'xAO's. When I'm importing to SQL, I'm importing as UTF-8 and I have my collations as UTF-8 but I can see in the database that the 'Â's are still there, meaning it's importing as ANSI, I guess. What's the best way to get rid of them? – Alec Davies Aug 30 '17 at 14:22
  • using a sensible editor, i hope? I believe there are bash/awk/whatever scripts to remove non-ascii chars, if your file is intended to be ascii-only anyway. – Jakumi Aug 30 '17 at 15:50
  • Notepad++, OK thanks I'll look into that and let you know how it goes – Alec Davies Aug 30 '17 at 16:29
  • GOTIT! https://stackoverflow.com/questions/8781911/remove-non-ascii-characters-from-string-in-php Somehow the whitespace was a special character and the short preg_replace formula removes it effectively and replaces it with whatever you want, in my case a space! Really happy thanks for pointing me in the right direction, despite my crappy explanation! – Alec Davies Aug 30 '17 at 16:35