The BNF form of URL is mentioned in the URL:
http://www.w3.org/Addressing/rfc1738.txt
What I need to do is extract the URLs from html text. Now I was wondering can I represent
String alpha = "[a-zA-Z]";
String alphadigit = "[a-zA-Z0-9]";
String domainlabel = alphadigit+"|"+alphadigit+"("+alphadigit+"|-)*?"+alphadigit;
//String toplabel = alpha+"|"+alpha+"("+alphadigit+"|-)*?"+alphadigit;
String toplabel = "com|org|net|mil|edu|(co\\.[a-z]+)";
String hostname = "(("+domainlabel+")\\.)*("+toplabel+")";
String hostport = hostname;
String lowalpha = "([a-z])";
String hialpha = "([A-Z])";
String alpha = "("+lowalpha+"|"+hialpha+")";
String digit = "([0-9])";
String safe = "($|-|_|.|\\+)";
String extra = "(!|\\*|'|\\(|\\)|,)";
//String national = "{" | "}" | "|" | "\" | "^" | "~" | "[" | "]" | "`";
String punctuation = "(<|>|#|%|\")";
String reserved = "(;|/|?|:|@|&|=)";
String hex = "("+digit+"[A-Fa-f]"+")";
String escape = "(%"+hex+hex+")";
String unreserved = "("+alpha+"|"+digit+"|"+safe+"|"+extra+")";
String uchar = "("+unreserved+"|"+escape+")";
String hsegment = "(("+uchar+"|;|:|@|&|=)*)";
String search = "("+uchar+"|;|:|@|&|=)?)";
String hpath = hsegment+"(/"+hsegment+")*";
//String httpurl = "http://"+hostport+"(/"+hpath+"(?"+search+")?)?";
String httpurl = "http://"+hostport+"/"+hpath;
The final regex:
http://(([a-zA-Z0-9]|[a-zA-Z0-9]([a-zA-Z0-9]|-)*?[a-zA-Z0-9])\.)*(com|org|net|mil|edu|(co\.[a-z]+))/(((((([a-z])|([A-Z]))|([0-9])|($|-|_|.|\+)|(!|\*|'|\(|\)|,))|(%(([0-9])[A-Fa-f])(([0-9])[A-Fa-f])))|;|:|@|&|=)*)(/(((((([a-z])|([A-Z]))|([0-9])|($|-|_|.|\+)|(!|\*|'|\(|\)|,))|(%(([0-9])[A-Fa-f])(([0-9])[A-Fa-f])))|;|:|@|&|=)*))*
So you can see I represented the whole BNF to a big regular expression which will be use with javax.util.regex methods to extract the URL out of text. Now is this the correct approach? If it is correct, then why do we need to write a context free grammar? What disadvantages the regex approach have?
Besides, for grammar parser, say for a language, the grammar is used to validate whether the code follows the grammar rules otherwise show some error messages. Also using the grammar we get a syntax tree which is used to evaluate the expression. For the URL thing we didn't evaulate anything. we just need to extract the urls out of the rest of the text.
I got this question, because previously I was trying to parse email address. After exhaustively searching for regular expressions, none of them turned out to be 100% accurate and some comment was made regarding the limitations of regex to match the exact BNF form of email addresses in RFC. Hence a grammar (instead of regex) might be required. Hence I have this question for URLs.
Thanks