2

I am banging my head against this regular expression the whole day.

The task looks simple, I have a number of XML tag names and I must replace (mask) their content.

For example

<Exony_Credit_Card_ID>242394798</Exony_Credit_Card_ID>

Must become

<Exony_Credit_Card_ID>filtered</Exony_Credit_Card_ID>

There are multiple such tags with different names

How do I match any text inside but without matching the tag itself?

EDIT: I should clarify again. Grouping and then using the group to avoid replacing the text inside does not work in my case, because when I add the other tags to the expression, the group number is different for the subsequent matches. For example:

"(<Exony_Credit_Card_ID>).+(</Exony_Credit_Card_ID>)|(<Billing_Postcode>).+(</Billing_Postcode>)"

replaceAll with the string "$1filtered$2" does not work because when the regex matches Billing_Postcode its groups are 3 and 4 instead of 1 and 2

Boris Hamanov
  • 3,085
  • 9
  • 35
  • 58

5 Answers5

6
String resultString = subjectString.replaceAll(
    "(?x)    # (multiline regex): Match...\n" +
    "<(Exony_Credit_Card_ID|Billing_Postcode)> # one of these opening tags\n" +
    "[^<>]*  # Match whatever is contained within\n" +
    "</\\1>  # Match corresponding closing tag",
    "<$1>filtered</$1>");
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • @Tim, don't all your inline-comments need to end with a line break? (can't test it myself right now...) – Bart Kiers Feb 11 '11 at 11:05
  • @Bart: Probably yes; I wrapped the first line manually and forgot to add the `\n`. By the way, is it legal to have a line break between two parameters of a method call (I added one between the search and replace terms)? – Tim Pietzcker Feb 11 '11 at 11:07
  • @Tim, yeah, that's perfectly legal: you can add as much of them as you like. – Bart Kiers Feb 11 '11 at 11:10
  • Problem is that I have multiple tags like I said, with different names. When I added them all to one expression <$1>filtered$1> would only work for the first group. I already tried it. – Boris Hamanov Feb 11 '11 at 11:19
  • 2
    You need to add the tags to the second line of the regex and separate them with `|` as I showed there. – Tim Pietzcker Feb 11 '11 at 11:27
1

In your situation, I'd use this:

(?<=<(Exony_Credit_Card_ID|tag1|tag2)>)(\\d+)(?=</(Exony_Credit_Card_ID|tag1|tag2)>)

And then replace the matches with filtered, as the tags are excluded from the returned match. As your goal is to hide sensitive data, it's better to be safe and use an "agressive" matching, trying to match as much possibly sensitive data, even if sometimes it is not.

You may need to adjust the tag content matcher ( \\d+ ) if the data contains other characters, like whitespaces, slashes, dashes and such.

mdrg
  • 3,242
  • 2
  • 22
  • 44
  • OP writes that there are multiple such tags with different names. – Tim Pietzcker Feb 11 '11 at 11:08
  • @Tim Pietzcker OK, I missed that. – mdrg Feb 11 '11 at 11:14
  • Thanks mdrg, your solution looks promissing as it does not rely on group numbers? Or so I think. Is everything inside the non capturing groups not matched? – Boris Hamanov Feb 11 '11 at 11:30
  • 1
    @avok00 Yes, it doesn't need group numbers. You may replace the entire match with your obfuscating string `filtered`, as the tags are not part of the match. This is done with zero-width lookahead and lookbehind as above. – mdrg Feb 11 '11 at 13:15
0

I have not debugged this code but you should use something like this:

Pattern p = Pattern.compile("<\\w+>([^<]*)<\\w+>");
Matcher m = p.matcher(str);
if (m.find()) {
    String tagContent = m.group(1);
}

I hope it is a good start.

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
AlexR
  • 114,158
  • 16
  • 130
  • 208
0

I would use something like this :

private static final Pattern PAT = Pattern.compile("<(\\w+)>(.*?)</\\1>");

private static String replace(String s, Set<String> toReplace) {
    Matcher m = PAT.matcher(s);
    if (m.matches() && toReplace.contains(m.group(1))) {
        return '<' + m.group(1) + '>' + "filtered" + "</" + m.group(1) + '>';
    }
    return s;
}
proactif
  • 11,331
  • 1
  • 17
  • 11
0

I know you said that relying on group numbers does not do in your case ... but I can't really see how. Could you not use something of the sort :

xmlString.replaceAll("<(Exony_Credit_Card_ID|tag2|tag3)>([^<]+)</(\\1)>", "<$1>filtered</$1>");

? This works on the basic samples I used as a test.

edit: just to decompose :

"<(Exony_Credit_Card_ID|tag2|tag3)>" + // matches the tag itself
"([^<]+)" + // then anything in between the opening and closing of the tag
"</(\\1)>" // and finally the end tag corresponding to what we matched as the first group (Exony_Credit_Card_ID, tag1 or tag2)

"<$1>" + // Replace using the first captured group (tag name)
"filtered" + // the "filtered" text
"</$1>" // and the closing tag corresponding to the first captured group
Kellindil
  • 4,523
  • 21
  • 20