2

In Java, what is the best way to split a string into an array of blocks, when the delimiters at the beginning of each block are different from the delimiters at the end of each block?

For example, suppose I have String string = "abc 1234 xyz abc 5678 xyz".

I want to apply some sort of complex split in order to obtain {"1234","5678"}.

The first thing that comes to mind is:

String[] parts = string.split("abc");
for (String part : parts)
{
    String[] blocks = part.split("xyz");
    String data = blocks[0];
    // Do some stuff with the 'data' string
}

Is there a simpler / cleaner / more efficient way of doing it?

My purpose (as you've probably guessed) is to parse an XML document.

I want to split a given XML string into the Inner-XML blocks of a given tag.

For example:

String xml = "<tag>ABC</tag>White Spaces Only<tag>XYZ</tag>";
String[] blocks = Split(xml,"<tag>","</tag>"); // should be {"ABC","XYZ"}

How would you implement String[] Split(String str,String prefix,String suffix)?

Thanks

Joshua Taylor
  • 84,998
  • 9
  • 154
  • 353
barak manos
  • 29,648
  • 10
  • 62
  • 114
  • 1
    You could use a regex but if your expressions get any more complicated than your sample you should probably use an xml parser. – assylias Jan 21 '14 at 18:45
  • 3
    You won't be able to handle `foo bar` correctly. – Joshua Taylor Jan 21 '14 at 18:46
  • 1
    @JoshuaTaylor is right. When working with non-trivial grammars such as XML, you should make use of software someone's already written. There's a learning curve for sure, but the time you'll spend to learn it is much less than the time you'd spend chasing down "odd" cases such as comments, nested tags and escaped terminators. – Zymurgeek Jan 21 '14 at 19:02

4 Answers4

1

You can write a regular expression for this type of string…

How about something like \s*((^abc)|(xyz\s*abc)|(\s*xyz$))\s* which says abc at the beginning, or xyz at the end, or abc xyz in the middle (modulo some spaces)? This produces an empty value at the beginning, but aside from that, it seems like it'd do what you want.

import java.util.Arrays;

public class RegexDelimitersExample {
    public static void main(String[] args) {
        final String string = "abc 1234 xyz abc 5678 xyz";
        final String pattern = "\\s*((^abc)|(xyz\\s*abc)|(\\s*xyz$))\\s*";
        final String[] parts_ = string.split( pattern );
        // parts_[0] is "", because there's nothing before ^abc,
        // so a copy of the rest of the array is what we want.
        final String[] parts = Arrays.copyOfRange( parts_, 1, parts_.length );
        System.out.println( Arrays.deepToString( parts ));
    }
}
[1234, 5678]

Depending on how you want to handle spaces, you could adjust this as necessary. E.g.,

\s*((^abc)|(xyz\s*abc)|(\s*xyz$))\s*     # original
(^abc\s*)|(\s*xyz\s*abc\s*)|(\s*xyz$)    # no spaces on outside
...                                      # ...

…but you shouldn't use it for XML.

As I noted in the comments, though, this will work for splitting a non-nested string that has these sorts of delimiters. You won't be able to handle nested cases (e.g., abc abc 12345 xyz xyz) using regular expressions, so you won't be able to handle general XML (which seemed to be your intent). If you actually need to parse XML, use a tool designed for XML (e.g., a parser, an XPath query, etc.).

Joshua Taylor
  • 84,998
  • 9
  • 154
  • 353
1

Don't use regexes here. But you don't have to do full-fledged XML parsing either. Use XPath. The expression to search for in your example would be

//tag/text()

The code needed is:

import org.w3c.dom.NodeList;
import org.xml.sax.*;
import javax.xml.xpath.*;

public class Test {

    public static void main(String[] args) throws Exception {

        InputSource ins = new InputSource("c:/users/ndh/hellos.xml");
        XPath xpath = XPathFactory.newInstance().newXPath();
        NodeList list = (NodeList)xpath.evaluate("//bar/text()", ins, XPathConstants.NODESET);
        for (int i = 0; i < list.getLength(); i++) {
            System.out.println(list.item(i).getNodeValue());
        }
        
    }
}

where my example xml file is

<?xml version="1.0"?>
<foo>
    <bar>hello</bar>
    <bar>ohayoo</bar>
    <bar>hola</bar>
</foo>
Nathan Hughes
  • 94,330
  • 19
  • 181
  • 276
1

The best is to use one of the dedicated XML parsers. See this discussion about best XML parser for Java.

I found this DOM XML parser example as a simple and good one.

Community
  • 1
  • 1
gromi08
  • 515
  • 3
  • 8
1

IMHO the best solution will be to parse the XML file, which is not a one line thing...

Look here

Here you have sample code from another question on SO to parse the document and then move around with XPATH:

String xml = "<resp><status>good</status><msg>hi</msg></resp>";

InputSource source = new InputSource(new StringReader(xml));

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document document = db.parse(source);

XPathFactory xpathFactory = XPathFactory.newInstance();
XPath xpath = xpathFactory.newXPath();

String msg = xpath.evaluate("/resp/msg", document);
String status = xpath.evaluate("/resp/status", document);

System.out.println("msg=" + msg + ";" + "status=" + status);

Complete thread of this post here

Community
  • 1
  • 1
dbermudez
  • 572
  • 3
  • 9