0

I have an html page. I want to extrac the text within only those tags that have a question mark at the end of their sentence/text. I using:

<.+?>(.+?)<.+?>

To get the text inside tags. but there are two problems with this: 1- All the nested tags are also extracted which I don't want.(I just want plain text) 2-I only want to get those text within tags that have a question mark at the end.

I don't know how to do this. Can someone help me please(in Java). PS: the html pages that I have are malformed, therefore, using tools such as JSoup is not a choice. That's why I am using regex only.

Hossein
  • 40,161
  • 57
  • 141
  • 175
  • Please read this answer: http://stackoverflow.com/a/1732454/396730 and use an HTML parser, not regexps. HTML parsers should usually be quite resistant to malformed documents, so try it out. – Philipp Wendler Aug 17 '12 at 10:10
  • 1
    Can you give example of the malformedness? If you really can't correct the input, foolproof non-regex based approach would be iterate over the input character at a time, when you see a > start buffering plaintext until you see a <, and so on... – Adam Aug 17 '12 at 10:14

2 Answers2

0

Detecting the nesting and not matching it is difficult or impossible if you have unlimited nesting, but you can try this:

<(.+?)>(.+?\?)</$1>

It matches tags that close again and only with a question mark at the end.

See on rubular

morja
  • 8,297
  • 2
  • 39
  • 59
0

Have you a good reason to use regular expressions?

You can analyse your html code yourself. Perhaps it is faster... Here a small solution if you don't have any tag inside <mytag?> and </mytag?>

    final LinkedList<String> chunks = new LinkedList<String>();

    final String text = "<i>italic</i><mytag?>text</mytag?><href>anchor</href> <mySecondTag?>word</mySecondTag?>";

    String rest = text;
    int pos;
    while ( (pos = rest.indexOf("?>") )!=-1)
    {
        final int endTag = rest.indexOf("<", pos);
        chunks.add(rest.substring(pos+2, endTag));
        rest = rest.substring(rest.indexOf(">", endTag+1)+1);
    }


    System.out.println(chunks);
Olivier Faucheux
  • 2,520
  • 3
  • 29
  • 37