0

I have read a post here that utf8_bin gives us more accuracy on comparing characters while utf8_general_ci does not.

I wonder - if I have a table that stores usernames and passwords, and I need them to be exact or correct when the user logs in my website.

Then should I use utf8_bin for this purpose?

Thanks.

EDIT:

By the way, this is the hash function I use to has the password,

function hash_sha512($phrase,&$salt = null)
{
    //$pepper = '!@#$%^&*()_+=-{}][;";/?<>.,';

    if ($salt == '')
    {
        $salt = substr(hash('sha512',uniqid(rand(), true).PEPPER_KEY.microtime()), 0, SALT_LENGTH);
    }
    else
    {
        $salt = substr($salt, 0, SALT_LENGTH);
    }

    return hash('sha512',$salt.PEPPER_KEY.$phrase);
}
Run
  • 54,938
  • 169
  • 450
  • 748
  • 2
    you should store your passwords as salted hash. bonus: with hashes you don't have to worry about unicode (only [a-f0-9]) – knittl Jan 24 '11 at 16:51
  • @knittl: yes i have hashed all my passwords. thanks. – Run Jan 24 '11 at 16:53
  • @lauthiamkok: What format does the hashed value has? – Gumbo Jan 24 '11 at 17:15
  • @knittl: In a world where renting a cheap AWS instance to build fast rainbow tables, salted hashes of `sha1()` won't do you much good. `sha1()` is simply too fast of an algorithm. – Andrew Moore Jan 24 '11 at 17:18
  • @andrew: still better than not salting or not hashing at all … – knittl Jan 24 '11 at 17:27
  • @Gumbo♦: the hashed value I have normally like this - 1ac5694de74e0b2b9c16df2039d8b316a62fd36cdc3b58a2f7df2a1c63f91cdf3f926f6203fb1125f877cb3588eea23588957cfa96d161fc82c5eb2af842d28d - is it safe? – Run Jan 24 '11 at 17:36
  • @lauthiamkok: So it’s a hexadecimal representation. – Gumbo Jan 24 '11 at 17:46
  • @Gumbo♦: any problems with hexadecimal representation? – Run Jan 24 '11 at 17:51
  • @lauthiamkok: No, not in general. You just have to take care with the case of the letters as it depends on the collation whether `a`=`A` or not. If it’s the latter case you should stick to either uppercase or lowercase. You could also use a different base than hexadecimal (base 16), for example a binary representation (base 256). – Gumbo Jan 24 '11 at 18:06
  • @Gumbo♦: thanks. what can I do if I want to use binary representation (base 256), any existing posts here I can go to? – Run Jan 24 '11 at 18:12
  • @lauthiamkok: [PHP’s `sha1`](http://php.net/sha1) does that by setting the second parameter to true. But then you need an appropriate data binary-safe type. See [Storing SHA1 hash values in MySQL](http://stackoverflow.com/questions/614476/storing-sha1-hash-values-in-mysql). – Gumbo Jan 24 '11 at 18:20
  • @Gumbo♦: thanks. I use sha512 as in my edit above. I thought I read it somewhere before that sha512 is now safer than sha1? – Run Jan 24 '11 at 18:26
  • @lauthiamkok: `sha512()` has the same problem as `sha1()`. It's too fast for use in a security context. DON'T TRY TO SOLVE CRYPTOLOGY BY YOURSELF. Trust experts in the field and use the [PHP Password Hashing Framework](http://www.openwall.com/phpass/)... It uses bcrypt and is secure TODAY... and will be secure TOMORROW (with a higher rounds count). – Andrew Moore Jan 24 '11 at 18:36
  • @Andrew Moore: thanks. I just got that download and going to try to integrate with my website :-) – Run Jan 24 '11 at 18:39

2 Answers2

2

You're talking about the collation -- it's the characters the MySQL table will support. The "_ci" on a collation indicates that the collation is Case Insensitive. Meaning, "a" == "A" while in a case sensitive collation the example would evaluate to being false.

So yes, choosing a collation that is case sensitive will provide better accuracy. You can store the values using a case insensitive collation, but set a particular one for the query evaluation using the COLLATE function.

OMG Ponies
  • 325,700
  • 82
  • 523
  • 502
1

First on the issue of password storage... Since you seem to be using PHP (from your question history)... Salted sha1() hashes just won't cut it in a world where renting a few AWS instances to compute fast rainbow tables... sha1() is too fast.

Instead of trying your hand at do-it-yourself cryptology, why not trust libraries made by actual experts in the field? Use the Portable PHP password hashing framework.

PHPass actually uses bcrypt, which is an algorithm designed to prevent rainbow table, dictionary and brute force attacks. You can initialize it with a number of rounds: the higher the rounds, the longer it takes to compute the hash. That way, you can create stronger hashes if processing power increases.

Using it is simple:

require('PasswordHash.php');

$phpass = new PasswordHash(12, false); // Initiate for 12 rounds, using bcrypt

// Hash a password
$hash = $phpass->HashPassword('my secret password');

// Compare an hash to a given password
$formSupplied = 'hello world';
$isRight = $phpass->CheckPassword($formSupplied, $hash);

if($isRight) echo "Good";
else echo "Wrong";

Now on the subject of usernames... Store them using a _bin collation (ie.: utf8_bin). This will force MySQL to binary compare when during a WHERE and effectively makes your usernames case-sensitive.

HOWEVER, since this is UTF-8, it is going to be important to normalize the username before inserting and querying your data. Different operating systems represent accented characters in different ways. PHP has the intl extension which has a facility for UTF-8 normalization. The following should do:

$_POST['username'] = Normalizer::normalize($_POST['username']);
Andrew Moore
  • 93,497
  • 30
  • 163
  • 175
  • doesn't utf8 state that there should only be one valid encoding for any character? – knittl Jan 24 '11 at 17:26
  • @knittl: é é... Same character, two different ways to write it... Some characters can be expressed either by its normalized form or by its base form plus combining diacritical marks. See [UAX #15: Unicode Normalization Forms](http://www.unicode.org/reports/tr15/). – Andrew Moore Jan 24 '11 at 17:34
  • @Andrew Moore :thanks for this reply. can I ask - what does it mean to 'normalize' here? why do we need to normalise the username? – Run Jan 24 '11 at 17:40
  • @Andrew Moore: yes I'm using PHP... :-) – Run Jan 24 '11 at 17:44
  • @andrew: the two characters look different on my screen, but i understand your point. using single chars vs. combining chars with their accents. thanks for clarifying – knittl Jan 24 '11 at 17:46
  • @leuthiamkok: [UAX #15: Unicode Normalization Forms](http://www.unicode.org/reports/tr15/) explains exactly what UTF-8 normalization is. For any characters that isn't part of the normal US alphabet, there are more then one way to represent one character. Depending of the OS, you'll get the same text represented in different ways. UTF-8 normalizations that the same text is always represented in the same way. – Andrew Moore Jan 24 '11 at 17:47
  • @Andrew Moore: thank you. my server is running php 5.1xx so I think I cannot use Normalizer::normalize... – Run Jan 24 '11 at 17:54
  • @lauthiamkok: You still can.. Just compile the intl extension. – Andrew Moore Jan 24 '11 at 18:37
  • @Andrew Moore: sorry again - 'compile'? I don't own the server, how do I compile the intl extension? thanks. – Run Jan 24 '11 at 18:40
  • @lauthiamkok: Then you are in a sticky situation. If you can't normalize UTF-8 strings that you need to compare after, then limit those to normal alpha-numeric sequences... I would personally push for an upgrade to PHP 5.3... Ask your provider. – Andrew Moore Jan 24 '11 at 18:46
  • @Andrew Moore: the server provider is annoying. looking for a modern and up-to-date server now! thanks for your advice :-) – Run Jan 24 '11 at 18:53