Java Regex html parser

Question

Possible Duplicate:
java regex quantifiers

I am learning some regex right now, and Im having trouble with this problem:

So I have a string like TAG1 sometext TAG2 some text TAG3 someText

What I need to get are the sub-strings between the tag statements. something like

Tag1 sometext
Tag2 some text
Tag3 someText

so I wrote this regex,

Pattern pattern = Pattern.compile("TAG\\d.*TAG\\d");
Matcher matcher = pattern.matcher(string);
while(matcher.find){
    print(matcher.group);
}

But the output is

TAG1 sometext TAG2 some text TAG3 someText

The way I understand it is, dot matches anything and star quantifies that to none or many. Since I believe my regex to mean TAG with some number then some other stuff then TAG and some number.

I am also realizing while I write this, that I do not want all subsets of TAG# text TAG# combinations. for example I do not want TAG# text TAG# text TAG#

can someone augment my understanding of regex please?

Thanks

EDIT ---

I am not writing a full blown html parser in regex. no. This is an html parsing project and I am using Jsoup for the biggest part of it. This regex is just a hack to get some meta data about the html so that I pass the html to jsoup in one form or another.

Regex is a two sided sword, use a HTML Parser instead, like http://jtidy.sourceforge.net/ — Nishant, Feb 09 '12 at 05:42
[You shouldn't try to parse HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Bohemian, Feb 09 '12 at 06:11
this is the answer: http://stackoverflow.com/a/9206766/720003 — b3bop, Feb 09 '12 at 22:16

score 1 · Answer 1 · answered Feb 09 '12 at 05:38

1

There is no group in your expression. Split them into groups using paranthesis. Like "(TAG\d)(.*)(TAG\d)" I am alos novice with regex, you might need to play with your regex but at least the paranthesis part is bare minimum.

answered Feb 09 '12 at 05:38

manocha_ak

904
7
19

I tried this, but it had the same result. just for learnings sake, how would one match a '(' then? thanks for the answer anyway. – b3bop Feb 09 '12 at 05:41
backslash or escape that like this (\( or \\ or \n etc etc) but it didnt work? Must be java specific regex engine anomaly... i mean there might be other way to do that. You want to get this exericse one time only or build an application using regex cause if former I might suggest you to use better regex tools. – manocha_ak Feb 09 '12 at 05:45
You'll want to add a group if you don't want to see the TAG part in the output: `TAG\d(.*)`, but see my answer about limiting what you're matching. Any time you want to match a special character as a literal, just add an escape: `\\(` will match a left parenthesis. – Dmitri Feb 09 '12 at 05:47
And on top of all these.. writing a full blown HTML parser in regex is very big and very tough exercise. Regex limitation will hit you. – manocha_ak Feb 09 '12 at 05:50

score 1 · Answer 2 · answered Feb 09 '12 at 05:40

1

Regex quantifiers are greedy by default - they will match as much as possible, so .* matches all the following TAG# sequences. Explanation of how to add appropriate modifiers here.

You may also find lookahead assertions to be useful.

Also, why is this tagged HTML? Doesn't seem like that's what you're parsing.

answered Feb 09 '12 at 05:40

Dmitri

8,999
5
36
43

ah you are right, it shouldn't be tagged as such. Its part of an html parsing project though, and my brain was on auto pilot... – b3bop Feb 09 '12 at 05:44
Just to echo @Nishant then, parsing HTML with regex is a hideous undertaking and should only be done for educational purposes. I've had good experience with the [Jericho](http://jericho.htmlparser.net/docs/index.html) parser. – Dmitri Feb 09 '12 at 05:51

Java Regex html parser

2 Answers2