1

This character set

[^\/:] // all characters except / or :

is weak per jslint b.c. I should be specifying the characters that can be used not he characters that can not be used per this SO Post.

This is for a simple not production level domain tester that looks like this:

domain:         /:\/\/(www\.)?([^\/:]+)/,

I'm just looking for some direction on how to think about this. The post mentions that allowing the myriad of Unicode characters is not a good thing...How do I formulate a plan to write this a tad better?

I am not concerned with the completeness of my domain checker ( it is just a prototype )...I am concerned with how to write reg-exes differently.

Cœur
  • 37,241
  • 25
  • 195
  • 267
  • instead of expressing a regex as characters you **can not** have, how do you express a regex as the characters that you **can** have... –  Aug 23 '12 at 22:21
  • `http://www.foo.com/some_path` –  Aug 23 '12 at 22:24

4 Answers4

2

According to http://en.wikipedia.org/wiki/Domain_name#Internationalized_domain_names

the character set allowed in the Domain Name System is based on ASCII

and as per http://www.netregister.biz/faqit.htm#1

to name your domain you can use any letter, numbers between 0 and 9, and the symbol "-" [as long as the first character is not "-"]

and considering that your domain must end with .something, you are looking for

([a-zA-Z0-9][a-zA-Z0-9-]*\.)+[a-zA-Z0-9][a-zA-Z0-9-]*
staafl
  • 3,147
  • 1
  • 28
  • 23
  • the dot is there to let you match the entire domain string up to the TLD, as in (www.dom1.dom2.website).com as for the '-' doesn't need to be escaped if it's the first or last character in the [] group ... – staafl Aug 24 '12 at 23:03
  • i see that i should repeat the initial group in order to ensure that none of the domain substrings start with a dash – staafl Aug 24 '12 at 23:04
1

This is a great question for Google, you know... but just to wet your beak: Matthew O'Riordan has written such regular expression that mathces link with or without protocol.

Here's link to his blog post

But for future reference let me provide the regular expression from the post here as well:

/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[.\!\/\\w]*))?)/

And as nicely broken down by blog writer Matthew himself:

(
 ( # brackets covering match for protocol (optional) and domain
  ([A-Za-z]{3,9}:(?:\/\/)?)   # match protocol, allow in format http:// or mailto:
  (?:[\-;:&=\+\$,\w]+@)?   # allow something@ for email addresses
  [A-Za-z0-9\.\-]+   # anything looking at all like a domain, non-unicode domains
  | # or instead of above
  (?:www\.|[\-;:&=\+\$,\w]+@) # starting with something@ or www.
  [A-Za-z0-9\.\-]+   # anything looking at all like a domain
 )
 ( # brackets covering match for path, query string and anchor
  (?:\/[\+~%\/\.\w\-]*)  # allow optional /path
  ?\??(?:[\-\+=&;%@\.\w]*)  # allow optional query string starting with ? 
  #?(?:[\.\!\/\\\w]*) # allow optional anchor #anchor 
 )? # make URL suffix optional
)

What about your particular example

But in your case of mathing URL domains the negative of [^\/:] could simply be:

[-0-9a-zA-Z_.]

And that should match everything after // and before first /. But what happens when your URLs don't end with a slash? what will you do in that case?

Upper regular expression (simplification) only matches one character just like your negative character set does. So this just replaces your negative set in the complete reg ex you're using.

Robert Koritnik
  • 103,639
  • 52
  • 277
  • 404
  • but the point of my question was missed...I don't care about URL checking but more understanding how I can specify the inverse of a regex... –  Aug 23 '12 at 22:30
  • Well provide the inverse... this particular reg ex doesn't have negative sets, only positive ones. And if you've had `[^012]` and you know you need numbers, then the negation of this would be `[3456789]`. the same is in your case. If you don't allow slashes and collons, then provide those characters that you do allow like `[-a-z0-9_.]` and probably some more. But for your testing these should likely suffice. – Robert Koritnik Aug 23 '12 at 22:34
  • 1
    @HiroProtagonist - If the point of your question is not really about URL checking why did you accept an answer that focused on that and ignored the "inverse regex" concept? – nnnnnn Aug 23 '12 at 22:35
  • He did both...looked up the character set I needed...and expressed it as the inverse of what I had...I was stuck on knowing whehter the characters set included Unicode characters in general, not just for domains, but in general, I didn't know what I didn't know until that answer was written...if someone would have said you need to know the character set first...this would have been answer enough...as common sense as this may seem...it did not seem that way earlier... –  Aug 23 '12 at 22:40
  • @HiroProtagonist - I'm glad you got an answer that helped you, but the regex in it is _not_ the inverse of the one in your question. Even ignoring obscure Unicode characters the inverse of `[^\/:]` would have to list other punctuation characters found on your keyboard... – nnnnnn Aug 23 '12 at 22:44
  • right...the question is phrased incorrectly - I didn't understand how Unicode characters had anything to do with regex checks in domains....so sorry I was unable to properly formulate the correct question....what i wanted say was....***another jslint error...what the f*** is wrong ....that is what I wanted to post...so I did the best I could...hopefully I wasn't the only one who learned something. –  Aug 23 '12 at 22:54
  • @HiroProtagonist: You did ok. If nnnnnnn read the whole of your question, then they'd know the question is not *just* about negative character set, but also about URL etc... You don't have to explain your answer acceptances. nnnnnn got his upvote anyway. But to some extent he's also correct. – Robert Koritnik Aug 23 '12 at 22:57
  • Thanks you guys. Sorry for being argumentative. It was just that the question title and the final sentence in italics made it sound like that was the focus and the domain-name thing was more an example, but again I really am glad you got the answer(s) you needed. (And don't upvote my question just to shut me up - if you think my answer is wrong you should _downvote_ it...) – nnnnnn Aug 23 '12 at 23:28
1

"I should be specifying the characters that can be used not he characters that can not be use"

No, that's nonsense, just JSLint being JSLint.

When you see [^\/:] in a regex it's immediately obvious what it is doing. If you tried to list all possible allowed characters the resulting regex would be horrendously difficult to read and it would be easy to accidentally forget to include some characters.

If you have a specific set of allowed characters then fine, list them. That's easier and more reliable than trying to list all possible invalid characters.

But if you have a specific set of invalid characters the [^] syntax is the appropriate way to do it.

nnnnnn
  • 147,572
  • 30
  • 200
  • 241
  • the choice to adhere to jshint is not non-sense..it is a preference...either way is fine...given my post is should be obvious I want to adhere. –  Aug 23 '12 at 22:28
  • I'm saying this particular JSLint recommendation is nonsense, for the reasons I stated. I didn't say _all_ JSLint recommendations are nonsense, nor did I say that the idea of using JSLint is nonsense. For your domain name regex the allowed character list is manageable, but you stated _"I am not concerned with the completeness of my domain checker ( it is just a prototype )...I am concerned with how to write reg-exes differently."_, so... – nnnnnn Aug 23 '12 at 22:30
1

Here`s a regex for characters you can have:

mycharactersarecool[^shouldnothavethesechars](oneoftwooptions|anotheroption)

Is this what you're talking about ?

Cosmin Atanasiu
  • 2,532
  • 3
  • 21
  • 26