How do I extract a multilingual string in between an xml tag

Question

I am trying to extract text in between an xml tag. The text in between the tag is multilingual. For example:

<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">
    तुम्हारा नाम क्या है
</string>

I have tried to google it and got a few regexes but that didn't work Here is one I have tried:

String str = "<string xmlns="+
    "http://schemas.microsoft.com/2003/10/Serialization/"+">"+
    "तुम्हारा नाम क्या है"+"</string>";

final Pattern pattern = Pattern.compile("<String xmlns="+
    "http://schemas.microsoft.com/2003/10/Serialization/"+">(.+?)</string>");

final Matcher matcher = pattern.matcher(str);
matcher.find();
System.out.println(matcher.group(1));

The given String format is

<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">
    तुम्हारा नाम क्या है
</string>

and the expected output is:

तुम्हारा नाम क्या है

It's giving me an error

For one, regex is case sensitive. You pattern will only match `String [...]` with an uppercase "S" — Håken Lid, Jun 07 '16 at 13:13
Please keep in mind: you can't parse XML or HTML with regular expressions. See http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la for the theory, and http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 for fun ... — GhostCat, Jun 07 '16 at 13:17
To add to Jägermeister’s point: https://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg — VGR, Jun 07 '16 at 13:30

Shafizadeh · Accepted Answer · 2016-06-07T13:21:20.783

4

This pattern matches expected part and $1 gives you expected result:

/<string .*?>(.*?)<\\/string>/

Online Demo

But highly recommended to stop doing that by regex ..! You have to find a HTML parser in JAVA and simply grab the content of <string> tag.

edited Jun 07 '16 at 13:21

answered Jun 07 '16 at 13:13

Shafizadeh

9,960
12
52
89

score 0 · Answer 2 · answered Jun 07 '16 at 16:40

Don’t use regular expressions for parsing XML. It will work in a few cases, but eventually it will fail. See Can you provide some examples of why it is hard to parse XML and HTML with a regex? for a full explanation.

The easiest way to extract an element’s string content is with XPath:

String contents =
    XPathFactory.newInstance().newXPath().evaluate(
        "//*[local-name()='string']",
        new InputSource(new StringReader(str)));

How do I extract a multilingual string in between an xml tag

2 Answers2

Online Demo