2

I'm trying to take an IDN URL along the lines of http://exämple.se/path or https://äxämple.se/anotherpath?foo=bar&baf=bas so that I get the components of it like so:

[0] http(s)://
[1] äxämple.se
[2] /anotherpath?foo=bar&baf=bas

My first thought was "I'll just use parse_url!". Well, except it doesn't do IDN domains so no luck.

Next I tried a bunch of my own regex tricks but somehow failed to get any useful output (some of them working to a degree but still painfully lacking.

Finally I tried various other peoples' regex patterns but none of them seemed to work right for me (work right = captured anything useful, one captured the whole url as its "protocol" part, most others I ran across captured nothing or were clearly functionally identical to ones I'd tried).

And of course, why am I doing this? I want to run idn_to_ascii on the domain name before piecing the URL back together and storing it in a db.

So, what am I doing wrong here? Is my approach completely wrong or is there some magic invocation of preg_match which will fix my problem?

Edit: Preferably I'd like a solution which doesn't involve downloading a blob of code someone else wrote (like say, a custom class named something like ParseIDNUrl weighing in at 100kB)

mludd
  • 729
  • 2
  • 7
  • 23

2 Answers2

2

parse_url should work fine. Using PHP 5.3.4 I've been able to extract just the domain part:

print parse_url('http://äxämple.se/foobar', PHP_URL_HOST);

Maybe you'll need to tweak encodings:

print utf8_decode(parse_url('http://äxämple.se/foobar', PHP_URL_HOST));

Output I've got is:

äxämple.se

Hope that helps!

alganet
  • 2,527
  • 13
  • 24
  • I think this may be some encoding issue on my part. It appears that if I do `print_r(parse_url(''));` it runs fine but if I use user input it doesn't handle it quite so well. And here I thought I had nice UTF-8 input all the way. Guess it's time to see what code touches my user input before it reaches `parse_url`… – mludd May 31 '12 at 16:53
0

I am sorry I didn't read your post at 100%.

Here's the regex I could find here : Properly Matching a IDN URL

\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
Community
  • 1
  • 1
David Bélanger
  • 7,400
  • 4
  • 37
  • 55