6

My web application stores URL segments in a database. These URL segments are based on user-submitted content.

What collation should I use for character strings that will appear in URLs?

My assumption is ASCII General CI (?) based on this question: Which characters make a URL invalid?

Community
  • 1
  • 1
Dolph
  • 49,714
  • 13
  • 63
  • 88

2 Answers2

3

It doesn't really matter as far as I can see. The characters valid in a URL are represented in any character set I know of, and I wouldn't use different collations between tables and columns - you'll get "illegal mix of collations" problems on any attempt to join them or perform any other kind of cross-column or cross-table operation (see my recent problem here).

Correct me if I'm wrong of course.

Community
  • 1
  • 1
Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • Then is there a performance factor for using, say, UTF-8 General CI instead of a "simpler" encoding that could store the same data (ASCII General CI)? – Dolph Feb 26 '10 at 19:33
  • I'm no database guru so I can't say for sure, but logic tells me no, because the characters you mention take up one byte in both standard ASCII and UTF-8 collations. I'm pretty sure any overhead *must* be minuscule. – Pekka Feb 26 '10 at 19:34
  • I believe case-insensitive encoding (**ascii_general_ci**) is slower then **ascii_bin**, because db engine needs to convert both values to the same case before performing actual compare. – vitsoft Feb 25 '14 at 21:58
  • 2
    @Pekka웃 The issue with using utf8mb4 over ascii for something like a phone number or URL only matters if storage space is an issue (doubt it) or the real issue, if you're using the columns in an index. If you have a URL field that's varchar(256) with ascii encoding, the key length is 256, but if you use utf8mb4, the key size has to be 1024 to support the possibility of that column containing a multibyte character, which is bigger than what InnoDB will allow for a key size – Brian Leishman Dec 22 '17 at 14:30
1

I would argue Case Sensitivity matters, since you don't want duplicate content from the URLs /home and /Home. These are 2 seperate pages, a mysql query in a _ci collation (select * from page where url='/Home') would return the page regardless of case.

Ben
  • 1,631
  • 13
  • 15