0

Let's say I have a string with an xml many occurences of <tagA>:

String example = " (...) some xml here (...)
                    <tagA>283940</tagA>
                   (...) some xml here (...)
                    <tagA>& 9940</tagA>
                    <tagA>- 99440</tagA>
                    <tagA>< 99440</tagA>
                    <tagA>99440</tagA>
                   (...) more xml here (...) "

The content should contain only digits, but sometimes it has a random character followed by a whitespace and the the digits. I want to remove the unwanted character and the whitespace. How to do that?

So far I know I should be looking for a regex "<tagA>. [0-9]*<\/tagA>" but I am stuck here.

I want to replace the characters because among those characters there are "&", ">", "<" signs which make the xml invalid (which prevents me from treating this as an XML).

Simon
  • 2,643
  • 3
  • 40
  • 61

1 Answers1

2

The regex that you're looking for is: <(\w+)>(\D{0,})(\d+)

On the search Group 1 you'll get the TAG, on the Group 2 you'll get your weird stuff (everything that is not a digit) and in Group 3 there's the number.

There's an "enhanced version" of this regex that might work in more situations: (\w{0,})(<\w+>)(\D{0,})(\d+)(\D{0,})(<\/\w+>)(\w{0,})

This will place in the Group 1 any whitespace that might be before the tag. Group 7 will take care of the trailing whitespaces. Group 2 and 6 will match the opening tag and closing tag. Group 3 and 5 will match any weird character that you might have between your value. Group 4 will contain your value.

With the String::replaceAll, you can filter and sanitize by printing only the group 2, 4 and 6, getting rid of the rest.

//input data
String s = "<tagA>283940</tagA>\n" +
"                    <tagA>& 9940<</tagA>\n" +
"                    <tagA>- 99440</tagA>\n" +
"                    <tagA>< 99440</tagA>\n" +
"                    <tagA>99440</tagA>"
                + "<13243> asdfasdf </>";


    String replaced = s.replaceAll("(\\s{0,})(<\\w+>)(\\D{0,})(\\d+)(\\D{0,})(<\\/\\w+>)(\\s{0,})", "$2$4$6");
    System.out.println(replaced);

Output: <tagA>283940</tagA><tagA>9940</tagA><tagA>99440</tagA><tagA>99440</tagA><tagA>99440</tagA><13243> asdfasdf </>

Alex Roig
  • 1,534
  • 12
  • 18