RegEx to extract text between a HTML tag

Question

I'm looking a regular expression which must extract text between HTML tag of different types.

For ex:

<span>Span 1</span> - O/p: Span 1

<div onclick="callMe()">Span 2</div> - O/p: Span 2

<a href="#">HyperText</a> - O/p: HyperText

I found this particular piece <([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1> from here But this one is not working.

I would like to refer you to the legendary top answer of this question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Philipp, Mar 28 '13 at 15:20
@MikeM it is not working I mean, it is not giving the desired result. It failed to extract the the content instead it is showing the entire HTML tag — Sriram, Mar 28 '13 at 15:22
@Philipp I had gone through that but couldn't find the exact answer. — Sriram, Mar 28 '13 at 15:26
@MikeM Yes. It is in the second group. Using java like this `test.replaceAll(patt, "$2")` — Sriram, Mar 28 '13 at 15:28
@Sriram The exact answer is: **Don't use regular expressions to parse HTML**, in case that wasn't obvious enough. — Philipp, Mar 28 '13 at 15:28
@MikeM I already provided - `"test".replaceAll(patt, "$2")` My intention of the above expression is to bring **test** as an o/p. but it is showing the entire the text as it is. — Sriram, Mar 28 '13 at 15:39
@MikeM I guess I should be more clear - `String outPut = "test".replaceAll("<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)\1>", "$2"); System.out.println(outPut);` — Sriram, Mar 28 '13 at 15:49
@MikeM You are brilliant. It worked charm. Thanks! Is it possible to check this recursively? I mean for nested tags like `sriram` — Sriram, Mar 28 '13 at 16:05
Your best bet is to use a HTML parser. Something like http://jsoup.org/. — Mahesh Guruswamy, Mar 28 '13 at 15:21

MikeM · Accepted Answer · 2013-03-28T16:05:51.550

10

Your comment shows that you have neglected to escape the backslashes in your regex string.

And if you want to match lowercase letters add a-z to the character classes or use Pattern.CASE_INSENSITIVE (or add (?i) to the beginning of the regex)

"<([A-Za-z][A-Za-z0-9]*)\\b[^>]*>(.*?)</\\1>"

If the tag contents may contain newlines, then use Pattern.DOTALL or add (?s) to the beginning of the regex to turn on dotall/singleline mode.

edited Mar 28 '13 at 16:05

answered Mar 28 '13 at 15:21

MikeM

13,156
2
34
47

Thanks for this. Yep I missed to add backslash in the expression. I'm looking for one more option in that expression which would recursively check the html tags and ultimately get the text between these tags. **Ex:** `test` I hope this time I'm very clear. – Sriram Mar 28 '13 at 16:23
@Sriram. To get the inner tags you would have to use the above regex in a loop, but I think you would be better to ask a new question for that. – MikeM Mar 28 '13 at 16:36
I am unable to retrive the content between the below tag
Ajay has no watch
So wait for a while to get time Please provide some solution – Kunal Varpe Mar 17 '17 at 06:22

score 1 · Answer 2 · answered Mar 28 '13 at 16:13

1

This should suit your needs:

<([a-zA-Z]+).*?>(.*?)</\\1>

The first group contains the tag name, the second one the value inbetween.

answered Mar 28 '13 at 16:13

sp00m

47,968
31
142
252

If multi tags are there the reg expression is not valid – Kunal Varpe Mar 17 '17 at 06:25

score 1 · Answer 3 · answered Jun 21 '22 at 18:07

Matcher matcher = Pattern.compile("<([a-zA-Z]+).*>(.+)</\\1+>")
    .matcher("<a href=\"#\">HyperText</a>");

while (matcher.find())
{
    String matched = matcher.group(2);

    System.out.println(matched + " found at "
        + "\n"
        + "start at :- " + matcher.start()
        + "\n"
        + "end at :- " + matcher.end()
        + "\n");
}

score -1 · Answer 4 · answered Mar 28 '13 at 15:24

-1

A very specific way:

(<span>|<a href="#">|<div onclick="callMe\(\)">)(.*)(</span>|</a>|</div>)

but yeah, this will only work for those 3 examples. You'll need to use an HTML parser.

answered Mar 28 '13 at 15:24

frickskit

624
1
8
19

the case may be with any of the HTML tag. can't say. – Sriram Mar 28 '13 at 15:30

RegEx to extract text between a HTML tag

4 Answers4

Ajay has no watch

Linked