10

Answers provided have all been great, I mentioned in the comments of Alnitak's answer that I would need to go take a look at my CSV Generation script because for whatever reason it wasn't outputting UTF-8.

As was correctly pointed out, it WAS outputting UTF-8 - the problem existed with Ye Olde Microsoft Excel which wasn't picking up the encoding the way I would have liked.

My existing CSV generation looked something like:

// Create file and exit;
$filename = $file."_".date("Y-m-d_H-i",time());
header("Content-type: application/vnd.ms-excel");
header("Content-disposition: csv" . date("Y-m-d") . ".csv");
header( "Content-disposition: filename=".$filename.".csv");
echo $csv_output;

It now looks like:

// Create file and exit;
$filename = $file."_".date("Y-m-d_H-i",time());
header("Content-type: text/csv; charset=ISO-8859-1");
header("Content-disposition: csv" . date("Y-m-d") . ".csv");
header("Content-disposition: filename=".$filename.".csv");

echo iconv('UTF-8', 'ISO-8859-1', $csv_output);

-------------------------------------------------------

ORIGINAL QUESTION

Hi,

I've got a form which collects data, form works ok but I've just noticed that if someone types or uses a '£' symbol, the MySQL DB ends up with '£'.

Not really sure where or how to stop this from happening, code and DB information to follow:

MySQL details

mysql> SHOW COLUMNS FROM fraud_report;
+--------------+--------------+------+-----+---------+----------------+
| Field        | Type         | Null | Key | Default | Extra          |
+--------------+--------------+------+-----+---------+----------------+
| id           | mediumint(9) |      | PRI | NULL    | auto_increment |
| crm_number   | varchar(32)  | YES  |     | NULL    |                |
| datacash_ref | varchar(32)  | YES  |     | NULL    |                |
| amount       | varchar(32)  | YES  |     | NULL    |                |
| sales_date   | varchar(32)  | YES  |     | NULL    |                |
| domain       | varchar(32)  | YES  |     | NULL    |                |
| date_added   | datetime     | YES  |     | NULL    |                |
| agent_added  | varchar(32)  | YES  |     | NULL    |                |
+--------------+--------------+------+-----+---------+----------------+
8 rows in set (0.03 sec)

PHP Function

function    processFraudForm($crm_number, $datacash_ref, $amount, $sales_date, $domain, $agent_added) {

    // Insert Data to DB
    $sql    = "INSERT INTO fraud_report (id, crm_number, datacash_ref, amount, sales_date, domain, date_added, agent_added) VALUES (NULL, '$crm_number', '$datacash_ref', '$amount', '$sales_date', '$domain', NOW(), '$agent_added')";
    $result = mysql_query($sql) or die (mysql_error());

    if ($result) {
        $outcome = "<div id=\"success\">Emails sent and database updated.</div>";
    } else {
        $outcome = "<div id=\"error\">Something went wrong!</div>";
    }

    return $outcome;
}

Example DB Entry

+----+------------+--------------+---------+------------+--------------------+---------------------+------------------+
| id | crm_number | datacash_ref | amount  | sales_date | domain             | date_added          | agent_added      |
+----+------------+--------------+---------+------------+--------------------+---------------------+------------------+
| 13 | 100xxxxxxx | 10000000     | £10.93 | 18/12/08   |  blargh.com        | 2008-12-22 10:53:53 | agent.name | 
suitedupgeek
  • 875
  • 3
  • 10
  • 19
  • 3
    I guess you've learnt something here. But, there is a very good beginners article at http://www.joelonsoftware.com/articles/Unicode.html - 'The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)' -- It gives context to how and why character encoding is the way it is and how it works. – DMin Aug 05 '10 at 16:11
  • 1
    I see you've already read that. But, I'll let the comment remain. Its a good resource to understand what character encoding is about. – DMin Aug 05 '10 at 16:21

7 Answers7

18

What you're seeing is UTF-8 encoding - it's a way of storing Unicode characters in a relatively compact format.

The pound symbol has value 0x00a3 in Unicode, but when it's written in UTF-8 that becomes 0xc2 0xa3 and that's what's stored in the database. It seems that your database table is already set to use UTF-8 encoding. This is a good thing!

If you pull the value back out from the database and display it on a UTF-8 compatible terminal (or on a web page that's declared as being UTF-8 encoded) it will look like a normal pound sign again.

Alnitak
  • 334,560
  • 70
  • 407
  • 495
  • Thanks, learn something new everyday, now to figure out how to make my CSV generation script output UTF-8 :) – suitedupgeek Dec 22 '08 at 14:59
  • if you're outputting CSV and it contains the C2A3 sequence, it already *is* UTF-8 - whatever reads the CSV needs to understand it or be told to expect UTF-8. – Paul Dixon Dec 22 '08 at 15:03
  • The pound sign is 0xA3 in an Ansi code page and 0x00A3 in Unicode, which is why the character appeared mostly normal in the DB. But in general, the UTF-8 version of non-ASCII characters doesn't work out so nicely. – Rob Kennedy Dec 22 '08 at 15:05
  • indeed - it's a coincidence that the second byte of the UTF-8 encoding of 0xA3 also happens to be 0xA3 ! – Alnitak Dec 22 '08 at 15:28
  • Not really a coincidence if you look at how the encoding works. Any 8 bit character in decimal range 128-191 has the first two binary digits as 10, which is the same as the UTF-8 lead bits, thus char n in that range is encoded as 0xC2 n – Paul Dixon Dec 22 '08 at 16:04
  • Got to be honest and say the encoding stuff has gone a little over my head, but combined with http://www.joelonsoftware.com/articles/Unicode.html and http://forums.devnetwork.net/viewtopic.php?f=1&t=91448&start=0&st=0&sk=t&sd=a - issue sorted. Question updated above for other eedgits like me! – suitedupgeek Dec 22 '08 at 16:32
  • ok, it's not a *real* coincidence, yes anything from 128 - 191 would do that. The coincidence is that the OP happened to see one of that relatively small set of characters where this is true. – Alnitak Dec 22 '08 at 16:34
8

£ is 0xC2 0xA3 which is the UTF-8 encoding for £ symbol - so you're storing it as UTF-8, but presumably viewing it as Latin-1 or something other than UTF-8

It's useful to know how to spot and decode UTF-8 by hand - check the wikipedia page for info on how the encoding works:

  • 0xC2A3 = 110 00010 10 100011
  • The bold parts are the actual "payload", which gives 10100011, which is 0xA3, the pound symbol.
Paul Dixon
  • 295,876
  • 54
  • 310
  • 348
  • 2
    Thanks, good answer which helped me understand the why - Alnitak got the Accepted because it helped me understand the how-to-fix! +1upvote – suitedupgeek Dec 22 '08 at 14:59
7

In PHP, another small scale solution is to do a string conversion on the returned utf8 string:

print iconv('UTF-8', 'ASCII//TRANSLIT', "Mystring â"); //"Mystring "

Or in other platforms fire a system call to the inconv command (linux / osx)

http://php.net/manual/en/function.iconv.php#83238

Ben
  • 1,203
  • 13
  • 8
2

You need to serve your HTML in utf-8 encoding (actually everyone needs to do this I think!) Header like:

Content-Type: text/html; charset=UTF-8

Or the equivalent. Double check the details though. Should always be declaring the charset as a browser can default to anything it likes.

1

To remove a  use:

$column = str_replace("\xc2\xa0", '', $column);

Credits among others: How to remove all occurrences of c2a0 in a string with PHP?

Community
  • 1
  • 1
user109764
  • 576
  • 6
  • 11
0

Thanks a lot. I had been suspecting mysql for being currupting the pound symbol. Now all i need to do is wherever the csv record is generated, just use wrap them incov funciton. Though this is a good job, I am happy, at least someone showed exactly what to do. I sincerly appreciate dislaying the previous and the new 'header' values. It was a great help to me.

-mark

0

If you save line "The £50,000 Development Challenge" in two different data type column i.e. "varchar" & "text" field.

Before i save i have replaced the symbol with html equi value using following function. str_replace("£", "£", $title);

You will find that value stored in text fields is &pound where as in varchar its "£".