0
StringTokenizer st = new StringTokenizer(remaining, "\t\n\r\"'>#");

String strLink = st.nextToken();

The input to string remaining can be one of the following :

  1. "http://somegreatsite.com">Link Name</a>is a link to another nifty site<H1>This is a Header</H1><H2>This is a Medium Header</H2>Send me mail at <a href="mailto:support@yourcompany.com">support@yourcompany.com</a>.<P> This is a new paragraph!<P> <B>This is a new paragraph!</B><BR> <B><I>This is a new sentence without a paragraph break, in bold italics.</I></B><HR></BODY></HTML>

  2. "mailto:support@yourcompany.com">support@yourcompany.com</a>.<P> This is a new paragraph!<P> <B>This is a new paragraph!</B><BR> <B><I>This is a new sentence without a paragraph break, in bold italics.</I></B><HR></BODY></HTML>

I know that the StringTokenizer constructor will split the string *remaining* into tokens using the regular expression. But I unable to understand the regular expression used here.

The strLink will have the following value based upon the value in the string *remaining*:

1.http://somegreatsite.com
2.mailto:support@yourcompany.com

Please help me in understanding the regular expression used in the code above.

Jim Garrison
  • 85,615
  • 20
  • 155
  • 190
  • 1
    each of the characters passed in following `remaining` will be treated as delimiters and every time one of those delimiters is encountered the input string will be split into a token. – Hunter McMillen Feb 26 '12 at 07:34
  • 2
    `"\t\n\r\"'>#"` is not a regular expression. It's just a simple list of chars. – Piotr Praszmo Feb 26 '12 at 07:36
  • 2
    Reading the javadoc really helps when you're a Java developer: http://docs.oracle.com/javase/6/docs/api/java/util/StringTokenizer.html#StringTokenizer%28java.lang.String,%20java.lang.String%29 – JB Nizet Feb 26 '12 at 08:02
  • Obligatory link: You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Sam Greenhalgh Feb 26 '12 at 12:04

1 Answers1

3

These characters \t\n\r\"'># are not regular expression, but delimiters. You can see meaning of special characters in Pattern class, for example.

\t - The tab character
\n - The newline (line feed) character
\r - The carriage-return character
\" - this is just a double quote
', >, # - other symbols
Andrew Logvinov
  • 21,181
  • 6
  • 52
  • 54