1

I have this text :

   <message id="dsds" to="test@test.com" type="video" from="test@test"><body>TESTTESTTEST</body><active xmlns="http://jabber.org"/></message>

And I want to get the part of <body></body> in this string.

In java, I m searching and found split, but it cant solve my problem. How can I get the text between <body></body> in java?

wassgren
  • 18,651
  • 6
  • 63
  • 77
Sibel Tahta
  • 63
  • 2
  • 10

6 Answers6

4

Using a Parser like SAXParser or DocumentBuilder is much preferred. You can accurately get the tags and process the data. They will be particularly handy when you have many tags to process.

Here is an example of using the Parser to read the body tag:

        SAXParserFactory factory = SAXParserFactory.newInstance();
        SAXParser saxParser = factory.newSAXParser();
        DefaultHandler handler = new DefaultHandler(){

            String body = "";
            boolean isBody = false;

            @Override
            public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {

                if (qName.equalsIgnoreCase("body")) {
                    isBody = true;
                }
            }

            @Override
            public void characters(char[] ch, int start, int length) throws SAXException {
                if (isBody) {
                    body = new String(ch, start, length);
                    System.out.println("body : " + body);
                }
            }

            @Override
            public void endElement(String uri, String localName, String qName) throws SAXException {
                if (qName.equalsIgnoreCase("body")) {
                    isBody = false;
                }
            }
        };

        saxParser.parse(new InputSource(new StringReader("<message id=\"dsds\" to=\"test@test.com\" type=\"video\" from=\"test@test\"><body id=\"dd\">TESTTESTTEST</body><active xmlns=\"http://jabber.org\"/></message>")), handler);
ZakiMak
  • 2,072
  • 2
  • 17
  • 26
2

use regex like this : (works for <body>asas asasa </body> as well as <body> </body>

public static void main(String[] args) {
    String s = "<message id=\"dsds\" to=\"test@test.com\" type=\"video\" from=\"test@test\"><body>TESTTESTTEST</body><active xmlns=\"http://jabber.org\"/></message>";
    Pattern p = Pattern.compile("<body.*>(.*?)</body>");
    Matcher m = p.matcher(s);
    while (m.find()) {
        System.out.println(m.group(1));
    }
}

O/P :

TESTTESTTEST
TheLostMind
  • 35,966
  • 12
  • 68
  • 104
  • have you seen the answers before answering? same as my answer but 9 minute later!:) – void Jan 07 '15 at 11:57
  • 1
    @FarhangAmary - Does your answer work for the inputs I have provided?. Inputs like `asas asasa `. Please check. Also, my *regex* is different. And if that helps, I saw your answer and *agreed* with Thilo. – TheLostMind Jan 07 '15 at 12:02
  • Well, there is something wrong in your regex .. it contains an odd amount of quotes. And as far as I see it, it would also fail if the body tag contains whitespaces (``) or attributes. – Tom Jan 07 '15 at 12:07
  • @Tom - Corrected it.. Was a typo. Thanks.. :).. Can you give me a sample input where this might fail?. – TheLostMind Jan 07 '15 at 12:09
  • @FarhangAmary - Please do whatever pleases you. – TheLostMind Jan 07 '15 at 12:10
  • @TheLostMind Thank you for the fix. It fails for example with `TESTTESTTEST` or `TESTTESTTEST`. The attribute in the second example is just a randomly selected one, but it shows, that the regex can't handle provided attributes. If OP says that there will be no attributes, then this is fine, but it would be good if you can find a solution for that. EDIT: the new regex handles that :). Good one. – Tom Jan 07 '15 at 12:13
  • @Tom - Not anymore. Check my *latest* answer :) – TheLostMind Jan 07 '15 at 12:14
  • @TheLostMind this won't work if we have a **space** inside last body tag for example won't work for : `TESTTESTTEST` test it – void Jan 07 '15 at 12:18
  • 2
    @TheLostMind Check my edit of the last comment :P. I already noticed that :). – Tom Jan 07 '15 at 12:28
1

Use regx package:

    String htmlString = "<message id=\"dsds\" to=\"test@test.com\" type=\"video\" from=\"test@test\"><body>TESTTESTTEST</body><active xmlns=\"http://jabber.org\"/></message>";
    String bodyText="";
    Pattern p = Pattern.compile("<body.*>(.*?)</body.*>");
    Matcher m = p.matcher(htmlString);

    if (m.find()) {
        bodyText = m.group(1);
    }
    System.out.println(bodyText);

OUTPUT: TESTTESTTEST

void
  • 7,760
  • 3
  • 25
  • 43
1

In that specific case, I'd recommend you to use regular expressions with Matcher

Possible solution: Java regex to extract text between tags

Community
  • 1
  • 1
jmartins
  • 991
  • 4
  • 16
  • 2
    You should include the essential parts of your links in your answer. If the link becomes invalid your answer will then be meaningless and this should be avoided. – Tom Jan 07 '15 at 12:08
  • The link is to a possible duplicated question/solution. Should I include "essential parts" from another Stack Overflow answer in my answer? – jmartins Feb 24 '15 at 14:08
  • 1
    Either that or flag this question as a possible duplicate of your found question (last approach is better). – Tom Feb 24 '15 at 14:44
1

You can write the code like this-

String s = "<message id=\"dsds\" to=\"test@test.com\" type=\"video\" from=\"test@test\"><body>TESTTESTTEST</body><active xmlns=\"http://jabber.org\"/></message>";//Use '/' character as escape for "
        int firstIndex = s.indexOf("<body>");
        int lastIndex = s.indexOf("</body>");
        System.out.println(s.substring(firstIndex+6, lastIndex));

And it will print the expected result.

Bharat
  • 121
  • 1
  • 4
0

Answer is already given for solving it through regex (though XML parser might have been the better choice).

Giving a simple suggestion to modify the regex proposed in above solutions:

Regex proposed: (<body.*>(.*?)</body.*>) => This regex is greedy. 
Non greed regex: <body[^>]*>(.*?)</body[^>]*>

You can make it non-greedy which will lead to improvement in running time. The problem with original regex is that .* will continue to match till the end of string and then it will backtrack. "[^>]" will stop as soon as it sees the right angle bracket. I ran a simple test comparing both the regex. Greedy one takes 3 times the time taken by non-greedy.