Parsing HTML with Regular Expressions?

Question

I've been trying to gather information using regular expressions:

Pattern hp = Pattern.compile("<small>.....</small>"); 
            Matcher mp = hp.matcher(code);
            while (mp.find()) {
                    String grupoHORARIO = mp.group();        
            System.out.println(grupoHORARIO);         }

When I run the program, instead of showing me:

RESULT1
RESULT2
RESULT3

It shows this:

<small>RESULT1</small>
<small>RESULT2</small>

As you see, it shows the opening and closing "small" tags before and after the word I am looking for. What I need is just the word, without the "small" tags around it.

[Don't use regular expressions to parse HTML; use an HTML or XML parser.](http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not) — VGR, Sep 03 '13 at 00:35
[You forgot this link.](http://stackoverflow.com/a/1732454/2030691) — Xynariz, Sep 03 '13 at 01:54
Canonical question: *[RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/)* — Peter Mortensen, Nov 11 '14 at 00:07

score 0 · Accepted Answer · edited May 23 '17 at 11:57

0

USING REGEX TO PARSE HTML IS BAD.

Again, using RegEx to parse HTML is bad.

That being said... In answer to your question, the problem is how you're using the Regular Expression. The only code of yours I would change is what is inside the Pattern.compile() method. The way you're currently doing it, (click on the Java button to view the results), you will only match when there is , then 5 characters, then . This match includes the start and end tags.

If what you want is to only match the middle parts, then you can try using RegEx lookaround. The way I did it is: (?<=).*(?=). Into parts:

.* - Any number of characters.

.*(?=) - Any number of characters that are followed by .

(?<=).*(?=) - Any number of characters that are preceded by  and followed by .

If you don't want to have it match any character, then replace the .* with whatever you do want to find (for example, ..... or {5}. will match 5 characters).

edited May 23 '17 at 11:57

Community

1
1

answered Sep 03 '13 at 02:23

Xynariz

1,232
1
11
28

That was really helpful! And, What can I do in cases when I just want to get any number of characters that are preceeded by the code below ? When I try to do that, it throws the following: "Invalid regular expression: Look behind-group does not have an obvious maximum length. – Seba Paz Sep 04 '13 at 00:59
The regullar expression that throws me the exception is this: `(?<=.*` – Seba Paz Sep 04 '13 at 01:00
In your example, you forgot the closing paren. It should be `(?<=).*` – Xynariz Sep 04 '13 at 01:17
-1 for `USING REGEX TO PARSE HTML IS BAD`. +1 for a good answer despite that. – Adrian Pronk Sep 04 '13 at 01:31
I read in so many websites that using regex to parse html is not right. But that's the only way I know for now – Seba Paz Sep 04 '13 at 01:43
@SebaPaz: Many people on SO want you to use an XML parser whenever you have to process any text that resembles XML. See [java XML parsing](http://www.google.co.nz/search?q=java+XML+parsing) (Google search) – Adrian Pronk Sep 04 '13 at 01:51
@AdrianPronk is right, XML parsers are the way to go. Hence the first few lines of my answer. Shame, though, that people -1 a completely useful answer, and +1 "Go Away, your approach is wrong" answers. – Xynariz Sep 04 '13 at 02:18
Yes, it is. I have not enough reputation yet to +1 your answer, but it was totally helpful to me. – Seba Paz Sep 04 '13 at 02:37
Hopefully people seeing this answer will think two things. 1: Regex is indeed powerful, and can do a lot of parsing successfully. 2: Don't use it in HTML. – Xynariz Sep 05 '13 at 21:15

Parsing HTML with Regular Expressions?

1 Answers1