1

I am trying to extract text in between an xml tag. The text in between the tag is multilingual. For example:

<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">
    तुम्हारा नाम क्या है
</string>

I have tried to google it and got a few regexes but that didn't work Here is one I have tried:

String str = "<string xmlns="+
    "http://schemas.microsoft.com/2003/10/Serialization/"+">"+
    "तुम्हारा नाम क्या है"+"</string>";

final Pattern pattern = Pattern.compile("<String xmlns="+
    "http://schemas.microsoft.com/2003/10/Serialization/"+">(.+?)</string>");

final Matcher matcher = pattern.matcher(str);
matcher.find();
System.out.println(matcher.group(1));

The given String format is

<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">
    तुम्हारा नाम क्या है
</string>

and the expected output is:

तुम्हारा नाम क्या है

It's giving me an error

Draken
  • 3,134
  • 13
  • 34
  • 54
nand
  • 517
  • 2
  • 13
  • 29
  • 1
    For one, regex is case sensitive. You pattern will only match `String [...]` with an uppercase "S" – Håken Lid Jun 07 '16 at 13:13
  • 3
    Please keep in mind: you can't parse XML or HTML with regular expressions. See http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la for the theory, and http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 for fun ... – GhostCat Jun 07 '16 at 13:17
  • To add to Jägermeister’s point: https://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg – VGR Jun 07 '16 at 13:30

2 Answers2

4

This pattern matches expected part and $1 gives you expected result:

/<string .*?>(.*?)<\\/string>/

Online Demo

But highly recommended to stop doing that by regex ..! You have to find a HTML parser in JAVA and simply grab the content of <string> tag.

Shafizadeh
  • 9,960
  • 12
  • 52
  • 89
0

Don’t use regular expressions for parsing XML. It will work in a few cases, but eventually it will fail. See Can you provide some examples of why it is hard to parse XML and HTML with a regex? for a full explanation.

The easiest way to extract an element’s string content is with XPath:

String contents =
    XPathFactory.newInstance().newXPath().evaluate(
        "//*[local-name()='string']",
        new InputSource(new StringReader(str)));
VGR
  • 40,506
  • 4
  • 48
  • 63