1

I have urls

domain.com
domain.com/
www.domain.com
http://www.domain.com
http://domain.com
domain.com/catalog/nextcatalog/
domain.com/catalog/nextcatalog/page.html
domain.com/page.html
domain.com/page.html?arg=123&arg2=abc

I want to get data like this

[0] = domain.com
[1] = catalog/nextcatalog/
[2] = page.html
[3] = arg=123&arg2=abc

I dont know how to get data from link like this

domain.com

In that case in [0] i have http:

It is possible to create universal regex that can skip for example catalog or page if its not in link?

I tried to create patern like this ^(?:http:\/\/)?(?:www\.)?(.*?)(?=\/)(.*)(?=\/)(.*)$ but this dont work in all cases

2 Answers2

2

Use Uri class to parse URLs as that class is designed to follow the relevant RFCs for parsing. That class will let you access the Scheme, host, port, path, querystring, etc of the URL it parses.

LB2
  • 4,802
  • 19
  • 35
2

I would recommend you to use existing Uri class which provides easy access to parts of uri. Some of urls in your sample list don't have scheme, so you just need to add it manually:

Uri uri = new Uri(url.StartsWith("http") ? url : "http://" + url);

Now you can use Uri.Host to get host of uri. For you sample input hosts will be

"domain.com"
"domain.com"
"www.domain.com"
"www.domain.com"
"domain.com"
"domain.com"
"domain.com"
"domain.com"
"domain.com"

You can do simple string replace to get rid of www part:

uri.Host.Replace("www.", "")

Next goes query parameters. You can get them from Url.Query. In your sample input only one url has query parameters. Returned value will be

?arg=123&arg2=abc

Again, it's easy to get rid of starting ?:

uri.Query.TrimStart('?') // arg=123&arg2=abc

Uri also has Segments collection which will contain array of segments. You can check if last segment contains . to get next result:

uri.Segments.Last().Contains('.') ? uri.Segments.Last() : ""

If this is true, then you will get page.html in last segment. Output:

""
""
""
""
""
""
"page.html"
"page.html"
"page.html"  

You also can use simple String.Join to concatenate other segments into string. Or you can do string replace on Uri.LocalPath:

uri.Segments.Last().Contains('.') ?
   uri.LocalPath.Replace(uri.Segments.Last(), "") : uri.LocalPath;

Output:

""
""
""
""
""
"/catalog/nextcatalog/"
"/catalog/nextcatalog/"
"/"
"/"

All you need to do is TrimStart to get rid of slash.

Sergey Berezovskiy
  • 232,247
  • 41
  • 429
  • 459
  • 1
    +1. Don't forget to use `HttpUtility.ParseQueryString` to [parse query string](http://stackoverflow.com/questions/68624/how-to-parse-a-query-string-into-a-namevaluecollection-in-net) – Alexei Levenkov May 09 '14 at 23:30
  • @AlexeiLevenkov completely agree with you. That will be next step of parsing url :) – Sergey Berezovskiy May 09 '14 at 23:36