1

The BNF form of URL is mentioned in the URL:

http://www.w3.org/Addressing/rfc1738.txt

What I need to do is extract the URLs from html text. Now I was wondering can I represent

            String alpha       = "[a-zA-Z]";
    String alphadigit  = "[a-zA-Z0-9]";
    String domainlabel = alphadigit+"|"+alphadigit+"("+alphadigit+"|-)*?"+alphadigit;       
    //String toplabel  = alpha+"|"+alpha+"("+alphadigit+"|-)*?"+alphadigit;
    String toplabel    = "com|org|net|mil|edu|(co\\.[a-z]+)";
    String hostname    = "(("+domainlabel+")\\.)*("+toplabel+")";
    String hostport    = hostname;

    String lowalpha    = "([a-z])";
    String hialpha     = "([A-Z])";
    String alpha       = "("+lowalpha+"|"+hialpha+")";
    String digit       = "([0-9])";
        String safe        = "($|-|_|.|\\+)";
    String extra       = "(!|\\*|'|\\(|\\)|,)";
    //String national    = "{" | "}" | "|" | "\" | "^" | "~" | "[" | "]" | "`";
    String punctuation = "(<|>|#|%|\")";
    String reserved    = "(;|/|?|:|@|&|=)";
    String hex         = "("+digit+"[A-Fa-f]"+")";
    String escape      = "(%"+hex+hex+")";
    String unreserved  = "("+alpha+"|"+digit+"|"+safe+"|"+extra+")";
    String uchar       = "("+unreserved+"|"+escape+")";
        String hsegment    = "(("+uchar+"|;|:|@|&|=)*)";
        String search      = "("+uchar+"|;|:|@|&|=)?)";
    String hpath       = hsegment+"(/"+hsegment+")*";
    //String httpurl = "http://"+hostport+"(/"+hpath+"(?"+search+")?)?";
    String httpurl = "http://"+hostport+"/"+hpath;

The final regex:

http://(([a-zA-Z0-9]|[a-zA-Z0-9]([a-zA-Z0-9]|-)*?[a-zA-Z0-9])\.)*(com|org|net|mil|edu|(co\.[a-z]+))/(((((([a-z])|([A-Z]))|([0-9])|($|-|_|.|\+)|(!|\*|'|\(|\)|,))|(%(([0-9])[A-Fa-f])(([0-9])[A-Fa-f])))|;|:|@|&|=)*)(/(((((([a-z])|([A-Z]))|([0-9])|($|-|_|.|\+)|(!|\*|'|\(|\)|,))|(%(([0-9])[A-Fa-f])(([0-9])[A-Fa-f])))|;|:|@|&|=)*))*

So you can see I represented the whole BNF to a big regular expression which will be use with javax.util.regex methods to extract the URL out of text. Now is this the correct approach? If it is correct, then why do we need to write a context free grammar? What disadvantages the regex approach have?

Besides, for grammar parser, say for a language, the grammar is used to validate whether the code follows the grammar rules otherwise show some error messages. Also using the grammar we get a syntax tree which is used to evaluate the expression. For the URL thing we didn't evaulate anything. we just need to extract the urls out of the rest of the text.

I got this question, because previously I was trying to parse email address. After exhaustively searching for regular expressions, none of them turned out to be 100% accurate and some comment was made regarding the limitations of regex to match the exact BNF form of email addresses in RFC. Hence a grammar (instead of regex) might be required. Hence I have this question for URLs.

Thanks

Darkzaelus
  • 2,059
  • 1
  • 15
  • 31
user285825
  • 475
  • 4
  • 13
  • just to make sure, you are aware of non-ASCII urls, don't you? – shabunc Jun 04 '13 at 07:53
  • Why do you wish to [reinvent the wheel](http://stackoverflow.com/a/285880/878469)? – predi Jun 04 '13 at 09:13
  • Are you sure this is a good approach? Because I would be handling thousands and thousands of web pages and for each of the web page if I do this URL checking thing then it would generate exception unnecessarily. Besides the URLs in my case might not even contain the http part then it sure is going to give malformed url exception. In fact if they had contained http for all of them, then I would not have the problem in the first place. – user285825 Jun 05 '13 at 06:09
  • sorry I think I made a mistake the (http://|https://)? – user285825 Jun 05 '13 at 06:18

2 Answers2

0

Well, I think your issue could be solved more easily using some heuristics about how http link looks like in free text. It could work more faster than such complicated regexp, especially if we are talking about large texts:

  1. http link (url) starts with unique http://
  2. from start to end URL doesn't contains some set of characters (white-spaces for example). When you came cross such character it means that you found end of URL.
Andremoniy
  • 34,031
  • 20
  • 135
  • 241
  • Yeah that is the problem. I am going to parse the craigslist web page and the texts are manually entered without any regards to any form. So some urls may be like this: http://blahblah.com.Location: atlanta. Hence I have to strictly adhere to the regular expression to exclude the possibility of such a thing happening (ie I am not 100% rely because some deranged user might enter URL without regard to proper punctuation and structure expecting the reader of the post to figure out on his own). Hence I am not 100% confident about any heuristics made. – user285825 Jun 04 '13 at 07:55
  • Well, any `URL` is free to contain `Location` word in the end, because rules of naming resources are free to giving such names for local resources. – Andremoniy Jun 04 '13 at 08:49
  • In the plain text ads you will see wrong slashes ("http:\\example.com"), quotes ("http://example - 1.com") etc. Apparently modern browsers and email clients make lot of stuff clickable. – Larytet Apr 08 '20 at 10:47
0

If the URL you are extracting is within tags (such as the href property of an anchor tag) then I'd recommend using JSoup to parse and inspect the HTML.

http://jsoup.org/

Within the body of text, I'm certain a more simple regex approach is possible, perhaps matching on the protocol (http://)

Simon Curd
  • 790
  • 5
  • 12