1

I'd like to store strings also in a more queryable slug-like format to the database, forcing it to lowercase, replacing the accented letters with their latin counterparts (ä -> a, ö -> o, ç -> c etc.) and replacing other special characters with e.g. dashes. Is there a standard for these kind of format? What would be preferable means to achieve it in Java?

hleinone
  • 4,470
  • 4
  • 35
  • 49
  • I would look at this post: http://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net – MikeKusold May 09 '11 at 13:46

2 Answers2

0

The database can do this for you through collations. Collations specify which characters in a specific character set can be considered equivalent with each other when compared.

Have a look at this for visual example of a collation:

http://www.collation-charts.org/mysql60/mysql604.utf8_general_ci.european.html

Here's a good description of how collations work from the MySQL manual:

http://dev.mysql.com/doc/refman/5.0/en/charset-syntax.html

Eric
  • 2,268
  • 2
  • 16
  • 24
  • I'm looking for a database provider ignorant solution as my backend most probably won't support that. – hleinone May 09 '11 at 14:23
  • You might try this library: [link](http://site.icu-project.org/#TOC-Why-ICU4J-). It allows you to work with character set collations in Java but not sure if it meets your particular use case. – Eric May 09 '11 at 14:39
  • The Java [`Normalizer`](http://download.oracle.com/javase/6/docs/api/java/text/Normalizer.html) seems to group them similarly as those MySQL links you provided, still it'll leave some characters like ð,ø and æ as is. I'd like to end up with just a-z and dashes. – hleinone May 09 '11 at 20:28
0

This is the solution that I've found working best so far:

return Normalizer
    .normalize(src.trim().toLowerCase(Locale.ENGLISH),
        Normalizer.Form.NFD)
    .replaceAll("\\p{InCombiningDiacriticalMarks}+", "")
    .replaceAll("[^\\p{ASCII}]+", "-")
    .replaceAll("[^a-z0-9]+", "-").replaceAll("(^-|-$)+", "");

This converts: ¿Qué? to que, Cool!!!!1 to cool-1 and åæø to a.

hleinone
  • 4,470
  • 4
  • 35
  • 49