1

I’m creating a MySQL database storing Chinese characters with associated pīnyīn pronunciations. I’ve set up everything to work in UTF-8 charset, so I’m having no troubles with most of the symbols I’m using. Except, strangely, some of certain latin characters with tone marks, and only when I write them into the database from $_POST, using PHP.

Those are: all characters with an acute accent (á, é, í, ó, ú), except ǘ (?!); and all characters with a grave accent (à, è ì ò ù), again, except ǜ. When they are typed into a form, and that form is submitted to the db, those characters are just cut off, like they never existed. E.g., cháng submits like chng. Any other characters (with a caron, like ǎ, or a macron, like ā) are written in fine, and so are actual Chinese characters.

Again, I’m using UTF-8 everywhere possible, and this sort of problem so far has been only experienced upon submitting data from a form. Before, I ran a script to manually insert an array, containing those characters, to the database, and everything went fine.

Any ideas?

Makoto
  • 104,088
  • 27
  • 192
  • 230
Arnold
  • 2,390
  • 1
  • 26
  • 45
  • 1
    I guess it's best to show us the code of this form. Perhaps some function used in there, is not multi-byte safe or some other problem exists. – ypercubeᵀᴹ May 28 '11 at 12:24
  • My function to get the $_POSTed data and INSERT into the database is as follows here: http://pastebin.com/DXw7czsp – Arnold May 28 '11 at 12:36
  • 1
    (a) are you using `htmlentities` anywhere? There are entities defined for the Latin-1 simple accented letters, but not the double-accents, so that could make a difference (use `htmlspecialchars` instead); – bobince May 28 '11 at 12:39
  • (b) make sure the browser is displaying the page in View -> Encoding -> UTF-8, so that all characters are submitted as plain byte sequences. If the browser thinks the page is in Western European encoding, it will use bytes for the simple accents which fits in that encoding, and broken HTML character reference escapes for the double accents, which again could make a difference if you're not handling it right. – bobince May 28 '11 at 12:39
  • 1
    Not enough information in that code paste to guess what's happening, but the `filterInput()` and `validateInput()` sound suspicious. If they are doing HTML- or SQL- escaping then (a) that might mangle characters and (b) that's totally the wrong approach; if they're not, then you've got SQL-injection holes in your query. – bobince May 28 '11 at 12:44
  • Yes! So stupid, how could I possibly missed that. Indeed, one of those functions were doing escaping, which was the reason for the problem. I removed that, and everything works now. I’m to to revisit the validation methods that I’ve been using for far. Thank you very much for pointing that out to me. – Arnold May 28 '11 at 23:38

2 Answers2

1

I think you may post pinyin in a numbered format.
e.g. cháng as cha2ng
And dealing with the post information in php script by some mapping methods.

Here's a method to deal with it.
Convert numbered to accentuated Pinyin?
Hopefully, it helps you.

Community
  • 1
  • 1
YeJiabin
  • 1,038
  • 1
  • 9
  • 17
1

I got a solution!

Before:

SELECT 'liàng' = 'liǎng';

Change to:

SELECT CONVERT('liàng' USING BINARY)= CONVERT('liǎng' USING BINARY) as equal;
HarrisonQi
  • 183
  • 7