1

I am trying to filter out invalid url from valid ones using .NET.

I am using Uri.TryCreate() method for this.

It has the following syntax

public static bool TryCreate(Uri baseUri,string relativeUri,out Uri result)

Now I am doing this....

Uri uri = null;

var domainList = new List<string>();
domainList.Add("asas");
domainList.Add("www.stackoverflow.com");
domainList.Add("www.codera.org");
domainList.Add("www.joker.testtest");
domainList.Add("about.me");
domainList.Add("www.ma.tt");

var correctList = new List<string>();

foreach (var item in domainList)
{
    if(Uri.TryCreate(item, UriKind.RelativeOrAbsolute, out uri))
    {    
        correctList.Add(item);
    }
}

I am trying the above code I expect it to remove asas and www.joker.testtest from the list, but it doesnt.

Can some one help me out on this.

Update : just tried out with Uri.IsWellFormedUriString this too did'nt help.

More Update

List of Valid uri

List of invalid uri

  • asas
  • as#@SAd
  • this.not.valid
  • www.asa.toptoptop
Yasser Shaikh
  • 46,934
  • 46
  • 204
  • 281
  • http://joshua-smith.net/articles/view-article/3/Check-if-a-URL-is-valid-with-C – Chuck Norris Sep 12 '12 at 11:00
  • take a look at this: http://stackoverflow.com/questions/924679/c-sharp-how-can-i-check-if-a-url-exists-is-valid – Alaa Jabre Sep 12 '12 at 11:00
  • What is your definition of a valid URI? All of your examples _are_ valid URIs, though not all are real _domains_. – Oded Sep 12 '12 at 11:01
  • Maybe you could check via `Uri.IsWellFormedUriString`, (oooops missed the update) – V4Vendetta Sep 12 '12 at 11:02
  • my guess he means the URI woyuldn't end with testtest so you will need to catch all uri endings – Alaa Jabre Sep 12 '12 at 11:03
  • @Oded `asas` and `www.joker.testtest` are not valid from the list, I want to remove them. Non real domains are a okay. – Yasser Shaikh Sep 12 '12 at 11:03
  • @V4Vendetta - The OP added an update that he has tried that. – Oded Sep 12 '12 at 11:03
  • What do you mean by "not valid"? They are, as far as the URI specification is concerned, valid. What you are looking for is something else - those are all syntactically valid URIs. – Oded Sep 12 '12 at 11:04
  • Are you able to perform pings from this tool or is this offline only? – Ryan McDonough Sep 12 '12 at 11:07
  • 2
    You need to define "Invalid". If "Invalid" means that URI does not exists then my solution will work. if "Invalid" means "Not Well Formed" the O.D. solution will suffice. otherwise, you need to define "Invalid"! – MBZ Sep 12 '12 at 11:16
  • you still need to be more precise. from your point of view, what's the difference between `www.ma.tt` and `www.asa.toptoptop`? is `www.asa.topt` "Invalid"?? – MBZ Sep 12 '12 at 11:29
  • @MBZ .tt is a `valid` top level domain, while `.toptoptop` is not – Yasser Shaikh Sep 12 '12 at 11:31
  • 1
    `MUSEUM` is a valid top-level domain as well. so something like `www.asa.MUSEUM" is valid? check the tlds here: http://data.iana.org/TLD/tlds-alpha-by-domain.txt – MBZ Sep 12 '12 at 11:32

6 Answers6

3

You seem to be confused about what exactly URL (or URI, the difference is not significant here) is. For example, http://stackoverflow.com is a valid absolute URL. On the other hand, stackoverflow.com is technically a valid relative URL, but it would refer to the file named stackoverflow.com in the current directory, not the website with that name. But stackoverflow.com is a registered domain name.

If you want to check whether a domain name is valid, you need to define what exactly do you mean by “valid”:

  1. Is it a valid domain name? Check whether the string consists of parts separated by dots, each part can contain letters, numbers and a hyphen (-). For example, asas and this.not.valid are both valid domain names.
  2. Could it be an Internet domain name? Domain names on the Internet (as opposed to intranet) are specific in that they always have a TLD (top-level domain). So, asas certainly isn't an Internet domain name, but this.not.valid could be.
  3. Is it a domain name under existing TLD? You can download the list of all TLDs and check against that. For example, this.not.valid wouldn't be considered valid under this rule, but thisisnotvalid.com would.
  4. Is it a registered domain name?
  5. Does the domain name resolve to an IP address? A domain name could be registered, but it still may not have an IP address in its DNS record.
  6. Does the computer the domain name points to respond to requests? The requests that make the most sense are a simple HTTP request (e.g. trying to access http://domaininquestion/) or ping.
svick
  • 236,525
  • 50
  • 385
  • 514
1

Try this one:

public static bool IsWellFormedUriString( string uriString, UriKind uriKind )

Or Alternativly you can do this using RegExp like :

^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$

Take alook at this list

CloudyMarble
  • 36,908
  • 70
  • 97
  • 130
1

The problem is that none of the urls you have added here will classify as Absolute URLs. For that you have to prefix the protocol of the URL to it.

You can test and find out that

www.stackoverflow.com - Relative URL
http://www.stackoverflow.com - Absolute URL
//www.stackoverflow.com - Absolute URL ( No surprise here. Refer RFC 3986: "Uniform Resource Identifier (URI): Generic Syntax", Section 4.2 )

The point is that you have to prefix at least // to show that its an absolute URL.

So, in a nutshell, since all your URLs are relative URLs, it passes all your tests.

naveen
  • 53,448
  • 46
  • 161
  • 251
  • 1
    This explains what is wrong with the code in the question, but it doesn't help fix it. – svick Sep 12 '12 at 12:19
0

All your examples are valid,
some are absolute URLs some are relative, thats why none are getting removed.

Else for each Uri, you might try and construct a HttpWebRequest class and then check for correct responses.

Vignesh.N
  • 2,618
  • 2
  • 25
  • 33
  • `asas` and `www.joker.testtest` are not valid from the list – Yasser Shaikh Sep 12 '12 at 11:07
  • Technically, this is correct. But saying that `www.stackoverflow.com` is a valid relative URL is not the right way to look at it, because that would represent file named `www.stackoverflow.com` in the current directory. – svick Sep 12 '12 at 12:18
0

After checking other's answer I am aware that you are not looking for existence of domain and ping back you need to test them based on your GRAMMER... or Syntax of domain name right?

For that you need to rely on regex test only... and make proper rule to eveluate the domain name and if they fail exclude them from the list.

You can adopt these patterns and modify one to suite your need and then test them with every element in the list.

John Conde
  • 217,595
  • 99
  • 455
  • 496
Jigar Pandya
  • 6,004
  • 2
  • 27
  • 45
-2

all of your URIs are Well-Formatted URIs so TryCreate and IsWellFormedUriString will not work in your case.

from here, the solutions is trying to open the URI:

using(var client = new MyClient()) {
    client.HeadOnly = true;
    // fine, no content downloaded
    string s1 = client.DownloadString("www.stackoverflow.com");
    // throws 404
    string s2 = client.DownloadString("www.joker.testtest");
}
Community
  • 1
  • 1
MBZ
  • 26,084
  • 47
  • 114
  • 191