5

In my Ruby app I need to handle URIs from user input (which are actually IRIs)

str = "http://उदाहरण.परीक्षा/मुख्य_पृष्ठ"

I normalize these using Addressable, and only store the normalized form:

normalized = Addressable::URI.parse(str).normalize
normalized.to_s
#=> http://xn--p1b6ci4b4b3a.xn--11b5bs3a9aj6g/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%8D%E0%A4%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0

This is nice to work with, but obviously not nice to display to end users.

For that I'd like to convert this URI back to its original form (non-punycode, non-percent-encoded-path)

Addressable has display_uri, but that only converts the host:

nicer = normalized.display_uri.to_s
#=> http://उदाहरण.परीक्षा/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%8D%E0%A4%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0

This looks like it works:

display_s = Addressable::URI.parse(str).display_uri.to_s
pretty = Addressable::URI.unencode(display_s.force_encoding("ASCII-8BIT"))

However, that code looks wrong (I should not need to use force_encoding) and I'm not at all confident that it is correct.

  • What is a good, sane way to convert the entire URI to something usable for end users ("http://उदाहरण.परीक्षा/मुख्य_पृष्ठ")

  • is storing the URIs normalized even a good idea or does that have consequences I might not be aware of?

code: https://gist.github.com/levinalex/6115764

tl;dr

how do I convert this:

"http://xn--p1b6ci4b4b3a.xn--11b5bs3a9aj6g/" +
"%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%8D%E0%A4" +
"%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0"

to this:

"http://उदाहरण.परीक्षा/मुख्य_पृष्ठ"
Charles
  • 50,943
  • 13
  • 104
  • 142
levinalex
  • 5,889
  • 2
  • 34
  • 48
  • First, +1 for using Addressable. I don't see a lot of advantage storing the normalized URI if your DB will accept it normally. If normalized, it's longer, it's not obvious what it is, and it'd be harder to search. – the Tin Man Jul 30 '13 at 19:58
  • Did the normalization for consistency and duplicate detection. Seemed simpler at first. Now I'm not so sure. – levinalex Jul 30 '13 at 20:57
  • Duplicate detection can be handled by the database automatically, using a `unique` setting on the index for that field. Your code shouldn't try to track that, instead it should react if the DBM rejects the insert/update due to a "duplicate key found" type error. You might want to normalize to resolve characters that have multiple code-points that point to the same character though as that discrepancy would fool the index. Or, maybe not if that ends up changing the URL into something that would resolve differently; That's the advantage of puny-codes. It's a tough call. – the Tin Man Jul 30 '13 at 22:24
  • 1
    Well, if everything is normalized, then and only **then** will database uniqueness do any good whatsoever. So normalizing before storing the field in the database is 100% the right call if duplicate detection is a desired behavior. – Bob Aman Aug 01 '13 at 16:58

1 Answers1

0

You should not need any forced (re-)encoding to recover the original URI. Simply:

normalised_s = "http://xn--p1b6ci4b4b3a.xn--11b5bs3a9aj6g/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%8D%E0%A4%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0"        
Addressable::URI.unencode(Addressable::URI.parse(normalised_s).display_uri)

=> "http://उदाहरण.परीक्षा/मुख्य_पृष्ठ"

To repeat what Bob said in the comments, normalisation is definitely a good way of guaranteeing uniqueness for storage.

i-blis
  • 3,149
  • 24
  • 31