4

I'm looking a regular expression which must extract text between HTML tag of different types.

For ex:

<span>Span 1</span> - O/p: Span 1

<div onclick="callMe()">Span 2</div> - O/p: Span 2

<a href="#">HyperText</a> - O/p: HyperText

I found this particular piece <([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1> from here But this one is not working.

Sriram
  • 2,909
  • 8
  • 27
  • 35
  • 1
    Please state exactly how it is not working. – MikeM Mar 28 '13 at 15:19
  • 3
    I would like to refer you to the legendary top answer of this question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Philipp Mar 28 '13 at 15:20
  • @MikeM it is not working I mean, it is not giving the desired result. It failed to extract the the content instead it is showing the entire HTML tag – Sriram Mar 28 '13 at 15:22
  • @Philipp I had gone through that but couldn't find the exact answer. – Sriram Mar 28 '13 at 15:26
  • @MikeM Yes. It is in the second group. Using java like this `test.replaceAll(patt, "$2")` – Sriram Mar 28 '13 at 15:28
  • 1
    @Sriram The exact answer is: **Don't use regular expressions to parse HTML**, in case that wasn't obvious enough. – Philipp Mar 28 '13 at 15:28
  • @MikeM I already provided - `"test".replaceAll(patt, "$2")` My intention of the above expression is to bring **test** as an o/p. but it is showing the entire the text as it is. – Sriram Mar 28 '13 at 15:39
  • @MikeM I guess I should be more clear - `String outPut = "test".replaceAll("<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)\1>", "$2"); System.out.println(outPut);` – Sriram Mar 28 '13 at 15:49
  • 1
    Use `"<([A-Za-z][A-Za-z0-9]*)\\b[^>]*>(.*?)\\1>"`. – MikeM Mar 28 '13 at 15:50
  • @MikeM You are brilliant. It worked charm. Thanks! Is it possible to check this recursively? I mean for nested tags like `sriram` – Sriram Mar 28 '13 at 16:05
  • Your best bet is to use a HTML parser. Something like http://jsoup.org/. – Mahesh Guruswamy Mar 28 '13 at 15:21

4 Answers4

10

Your comment shows that you have neglected to escape the backslashes in your regex string.

And if you want to match lowercase letters add a-z to the character classes or use Pattern.CASE_INSENSITIVE (or add (?i) to the beginning of the regex)

"<([A-Za-z][A-Za-z0-9]*)\\b[^>]*>(.*?)</\\1>"

If the tag contents may contain newlines, then use Pattern.DOTALL or add (?s) to the beginning of the regex to turn on dotall/singleline mode.

MikeM
  • 13,156
  • 2
  • 34
  • 47
  • Thanks for this. Yep I missed to add backslash in the expression. I'm looking for one more option in that expression which would recursively check the html tags and ultimately get the text between these tags. **Ex:** `test` I hope this time I'm very clear. – Sriram Mar 28 '13 at 16:23
  • @Sriram. To get the inner tags you would have to use the above regex in a loop, but I think you would be better to ask a new question for that. – MikeM Mar 28 '13 at 16:36
  • I am unable to retrive the content between the below tag

    Ajay has no watch

    So wait for a while to get time Please provide some solution
    – Kunal Varpe Mar 17 '17 at 06:22
1

This should suit your needs:

<([a-zA-Z]+).*?>(.*?)</\\1>

The first group contains the tag name, the second one the value inbetween.

sp00m
  • 47,968
  • 31
  • 142
  • 252
1
Matcher matcher = Pattern.compile("<([a-zA-Z]+).*>(.+)</\\1+>")
    .matcher("<a href=\"#\">HyperText</a>");

while (matcher.find())
{
    String matched = matcher.group(2);

    System.out.println(matched + " found at "
        + "\n"
        + "start at :- " + matcher.start()
        + "\n"
        + "end at :- " + matcher.end()
        + "\n");
}
Ammy
  • 369
  • 2
  • 8
-1

A very specific way:

(<span>|<a href="#">|<div onclick="callMe\(\)">)(.*)(</span>|</a>|</div>)

but yeah, this will only work for those 3 examples. You'll need to use an HTML parser.

frickskit
  • 624
  • 1
  • 8
  • 19