0

I have an large String which contains some XML. This XML contains input like:

<xyz1>...</xyz1>
<hello>text between strings #1</hello>
<xyz2>...</xyz2>
<hello>text between strings #2</hello>
<xyz3>...</xyz3>

I want to get all these <hello>text between strings</hello>.

So in the end I want to have a List or any Collection which contains all <hello>...</hello>

I tried it with Regex and Matcher but the problem is it doesn't work with large strings.... if I try it with smaller Strings, it works. I read a blogpost about this and this says the Java Regex Broken for Alternation over Large Strings.

Is there any easy and good way to do this?

Edit:

An attempt is...

String pattern1 = "<hello>";
String pattern2 = "</hello>";
List<String> helloList = new ArrayList<String>();

String regexString = Pattern.quote(pattern1) + "(.*?)" + Pattern.quote(pattern2);


Pattern pattern = Pattern.compile(regexString);

Matcher matcher = pattern.matcher(scannerString);
while (matcher.find()) {
  String textInBetween = matcher.group(1); // Since (.*?) is capturing group 1
  // You can insert match into a List/Collection here
  helloList.add(textInBetween);
  logger.info("-------------->>>> " + textInBetween);
}
DaUser
  • 357
  • 3
  • 5
  • 19

4 Answers4

1

You have to parse your xml with an xml parser. It is easier than using regular expressions.

DOM parser is the simplest to use, but if your xml is very big use the SAX parser

Davide Lorenzo MARINO
  • 26,420
  • 4
  • 39
  • 56
1

I would highly recommend using one of the multiple public XML parsers available:

It is simply easier to achieve what you're trying to achieve (even if you wish to elaborate on your request in the future). If you have no issues with speed and memory, go ahead and use dom4j. There is ALOT of resource online if you wish me to post good examples on this answer for you, as my answer right now is simply redirecting you alternative options but I'm not sure what your limitations are.


Regarding REGEX when parsing XML, Dour High Arch gave a great response:

XML is not a regular language. You cannot parse it using a regular expression. An expression you think will work will break when you get nested tags, then when you fix that it will break on XML comments, then CDATA sections, then processor directives, then namespaces, ... It cannot work, use an XML parser.

Parsing XML with REGEX in Java

Community
  • 1
  • 1
Juxhin
  • 5,068
  • 8
  • 29
  • 55
1

If you have to parse an XML file, I suggest you to use XPath language. So you have to do basically these actions:

  1. Parse the XML String inside a DOM object
  2. Create an XPath query
  3. Query the DOM

Try to have a look at this link.

An example of what you haveto do is this:

String xml = ...;
try {
   // Build structures to parse the String
   DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
   // Parse the XML string into a DOM object
   Document document= builder.parse(new ByteArrayInputStream(xml.getBytes()));
   // Create an XPath query
   XPath xPath =  XPathFactory.newInstance().newXPath();
   // Query the DOM object with the query '//hello'
   NodeList nodeList = (NodeList) xPath.compile("//hello").evaluate(document, XPathConstants.NODESET);
} catch (Exception e) {
   e.printStackTrace();
}
riccardo.cardin
  • 7,971
  • 5
  • 57
  • 106
0

With Java 8 you could use the Dynamics library to do this in a straightforward way

XmlDynamic xml = new XmlDynamic(
    "<bunch_of_data>" +
        "<xyz1>...</xyz1>" +
        "<hello>text between strings #1</hello>" +
        "<xyz2>...</xyz2>" +
        "<hello>text between strings #2</hello>" +
        "<xyz3>...</xyz3>" +
    "</bunch_of_data>");

List<String> hellos = xml.get("bunch_of_data").children()
    .filter(XmlDynamic.hasElementName("hello"))
    .map(hello -> hello.asString())
    .collect(Collectors.toList()); // ["text between strings #1", "text between strings #2"]

See https://github.com/alexheretic/dynamics#xml-dynamics

Alex Butler
  • 306
  • 3
  • 6