How to use Regular Expression in StringTokenizer

Question

StringTokenizer st = new StringTokenizer(remaining, "\t\n\r\"'>#");

String strLink = st.nextToken();

The input to string remaining can be one of the following :

"http://somegreatsite.com">Link Name</a>is a link to another nifty site<H1>This is a Header</H1><H2>This is a Medium Header</H2>Send me mail at <a href="mailto:support@yourcompany.com">support@yourcompany.com</a>.<P> This is a new paragraph!<P> <B>This is a new paragraph!</B><BR> <B><I>This is a new sentence without a paragraph break, in bold italics.</I></B><HR></BODY></HTML>
"mailto:support@yourcompany.com">support@yourcompany.com</a>.<P> This is a new paragraph!<P> <B>This is a new paragraph!</B><BR> <B><I>This is a new sentence without a paragraph break, in bold italics.</I></B><HR></BODY></HTML>

I know that the StringTokenizer constructor will split the string *remaining* into tokens using the regular expression. But I unable to understand the regular expression used here.

The strLink will have the following value based upon the value in the string *remaining*:

1.http://somegreatsite.com
2.mailto:support@yourcompany.com

Please help me in understanding the regular expression used in the code above.

each of the characters passed in following `remaining` will be treated as delimiters and every time one of those delimiters is encountered the input string will be split into a token. — Hunter McMillen, Feb 26 '12 at 07:34
`"\t\n\r\"'>#"` is not a regular expression. It's just a simple list of chars. — Piotr Praszmo, Feb 26 '12 at 07:36
Reading the javadoc really helps when you're a Java developer: http://docs.oracle.com/javase/6/docs/api/java/util/StringTokenizer.html#StringTokenizer%28java.lang.String,%20java.lang.String%29 — JB Nizet, Feb 26 '12 at 08:02
Obligatory link: You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Sam Greenhalgh, Feb 26 '12 at 12:04

score 3 · Answer 1 · answered Feb 26 '12 at 08:01

These characters \t\n\r\"'># are not regular expression, but delimiters. You can see meaning of special characters in Pattern class, for example.

\t - The tab character
\n - The newline (line feed) character
\r - The carriage-return character
\" - this is just a double quote
', >, # - other symbols

How to use Regular Expression in StringTokenizer

1 Answers1