12

(I come from the python world, so I apologise if some of the terminology I use jars with the norm.)

I have a String with a List of start/end indices to replace. Without getting too much into detail, consider this basic mockup:

String text = "my email is foo@bar.com and my number is (213)-XXX-XXXX"
List<Token> findings = SomeModule.someFnc(text);

And Token has the definition of

class Token {
    int start, end;
    String type;
}

This List represents start and end positions of sensitive data that I'm trying to redact.

Effectively, the API returns data that I iterate over to get:

[{ "start" : 12, "end" : 22, "type" : "EMAIL_ADDRESS" }, { "start" : 41, "end" : 54, "type" : "PHONE_NUMBER" }]

Using this data, my end goal is to redact the tokens in text specified by these Token objects to get this:

"my email is [EMAIL_ADDRESS] and my number is [PHONE_NUMBER]"

The thing that makes this question non-trivial is that the replacement substrings aren't always the same length as the substrings they're replacing.

My current plan of action is to build a StringBuilder from text, sort these IDs in reverse order of start indices, and then replace from the right end of the buffer.

But something tells me there should be a better way... is there?

cs95
  • 379,657
  • 97
  • 704
  • 746
  • Wait...you're _starting off_ with a string containing an email address, and you want to _replace_ that address by a token? Is that right? – Tim Biegeleisen Jun 19 '18 at 05:27
  • I would maybe go with the approach of token-ising all strings and then provide have a class that stores the original string and its replacement - from which it is easy to rebuild into the original of the redacted version – Scary Wombat Jun 19 '18 at 05:31
  • @TimBiegeleisen yes, I'm implementing a PII redactor. – cs95 Jun 19 '18 at 05:32
  • Is it a requirement to use `SomeModule.someFnc(text);` ? – Scary Wombat Jun 19 '18 at 05:39
  • @ScaryWombat Yep, that's an upstream dependency I call to get a list of tokens that need to be replaced. – cs95 Jun 19 '18 at 05:43
  • @rustyx That looks interesting... could you please convert that into an answer? – cs95 Jun 19 '18 at 06:12
  • So you get a list of places you need to replace as input, no need to search for anything, correct? Is the list sorted by position? – rustyx Jun 19 '18 at 06:16
  • 3
    @rustyx Yes, exactly. The `List` list is sorted in ascending order of start index. – cs95 Jun 19 '18 at 06:17
  • I think fundamentally, you're trying to build a lexical analyzer and parser to generate an abstract syntax tree. As a result, you would most likely be best served by just using the standard, formal approaches or existing technologies to do so. (E.g., define a formal grammar, work out the state machine, implement it, or use some lib that generates the state machine implementation from grammar for you.) Once you have the syntax tree, you can redact the elements you want to and convert it back to a string. That makes me think this is Too Broad. – jpmc26 Jun 19 '18 at 08:06
  • 2
    @jpmc26 I'm not trying to build anything. There's an API that sniffs out sensitive information and returns possible matches. I'm reading those match objects and redacting strings manually. Nothing broad about string replacement? – cs95 Jun 19 '18 at 08:09

3 Answers3

10

This approach works:

import java.util.ArrayList;
import java.util.List;

public class Test {
    public static void main(String[] args) {
        String text = "my email is foo@bar.com and my number is (213)-XXX-XXXX";

        List<Token> findings = new ArrayList<>();
        findings.add(new Token(12, 22, "EMAIL_ADDRESS"));
        findings.add(new Token(41, 54, "PHONE_NUMBER"));

        System.out.println(replace(text, findings));
    }

    public static String replace(String text, List<Token> findings) {
        int position = 0;
        StringBuilder result = new StringBuilder();

        for (Token finding : findings) {
            result.append(text.substring(position, finding.start));
            result.append('[').append(finding.type).append(']');

            position = finding.end + 1;
        }

        return result.append(text.substring(position)).toString();
    }
}

class Token {
    int start, end;
    String type;

    Token(int start, int end, String type) {
        this.start = start;
        this.end = end;
        this.type = type;
    }
}

Output:

my email is [EMAIL_ADDRESS] and my number is [PHONE_NUMBER]
Robby Cornelissen
  • 91,784
  • 22
  • 134
  • 156
  • I see. So instead of replacing from the end, you append from the start. That's quite similar to the iterative replacement I was thinking of to begin with. Is there nothing better? (I can take "no" for an answer.) – cs95 Jun 19 '18 at 05:35
  • 2
    Oh there probably is a more elegant solution out there. I'll give it some more thought. – Robby Cornelissen Jun 19 '18 at 05:36
  • 1
    Alright, thanks. But I appreciate your answer nonetheless, so +1. – cs95 Jun 19 '18 at 05:37
  • @coldspeed are your strings small or large? Do you have many such strings you need to perform replacements for? Depending on these factors there is probably a way to parallelize – Coder-Man Jun 19 '18 at 06:03
  • @POrekhov Well, they will mostly be the size of a small email, and will not have more than 2-3 replacements. I'm running this code inside a Map Reduce cluster so as long as the code is efficient, that's fine with me :-) – cs95 Jun 19 '18 at 06:07
  • @coldspeed I am not familiar with Map Reduce clusters. But does the code you write have to be parallelized in java, for example you have a large list of strings to replace and you want to do those replacement jobs in parallel, like one thread does replacements for the first 100, another for the second 100, etc, or will the map reduce cluster somehow parallelize this? – Coder-Man Jun 19 '18 at 06:13
  • @POrekhov I just need to make sure replacements are efficient for one string and then the framework handles the parallelization for the remaining 100,000 of them. – cs95 Jun 19 '18 at 06:15
  • @coldspeed anyways, this version will be faster on large inputs, because it uses StringBuilder, also my version had a mistake, I did text.split(splitter), instead of text.split(Pattern.quote(splitter)), split takes a regex string as input parameter, and if there's something strange written in the regex, there will be an error (I just tested my code and it gave me invalid character range error, however, using Pattern.qoute fixed it) – Coder-Man Jun 19 '18 at 06:53
  • 2
    I would preallocate the buffer like `new StringBuilder(text.length() + 32)`. Otherwise this is the fastest possible solution (in Java the goal is to minimize object creation). – rustyx Jun 19 '18 at 07:34
  • @rustyx Good point. Or even `new StringBuilder(text.length() + findings.size() * 10)` or something to that effect. – Robby Cornelissen Jun 19 '18 at 07:39
  • You need to make sure your redactions are in a proper order... reverse them and see... Otherwise this is what I had in mind also – Heiko Hatzfeld Jun 19 '18 at 09:37
  • @HeikoHatzfeld The order is guaranteed. OP stated in the comments to the question: *"The `List` list is sorted in ascending order of start index."* – Robby Cornelissen Jun 19 '18 at 09:39
3

Ensure that all tokens are sorted by start index in ascending order:

List<Token> tokens = new ArrayList<>();
tokens.sort(Comparator.comparing(Token::getStart));

Now you can replace all strings starting from the end of the input text:

public String replace(String text, List<Token> tokens) {
    StringBuilder sb = new StringBuilder(text);
    for (int i = tokens.size() - 1; i >= 0; i--) {
        Token token = tokens.get(i);
        sb.replace(token.start, token.end + 1, "[" + token.type + "]");
    }
    return sb.toString();
}
Oleksandr Pyrohov
  • 14,685
  • 6
  • 61
  • 90
1

Extract the substring between start and end, and split by it. Then you get an array of 2 elements, insert what you want in between. Next you have to move your next strings' to replace ids by the difference between the (previous string's that you replaced length) and (the string that you put in its place).

Code(in by case the 'end' in Token is exclusive):

public class Main {

    public static void main(String... args) {
        String text = "I want to replace AAA and B and scary wombat";
        Token[] tokens = {new Token(18, 21, "TEST"), new Token(26, 27, "TEST"), new Token(32, 44, "TEST")};
        int delta = 0;
        for (Token token : tokens) {
            String splitter = text.substring(token.start + delta, token.end + delta);
            System.out.println("Splitter: " + splitter);
            delta += token.replacement.length() - splitter.length();
            String[] beforeAndAfter = text.split(Pattern.quote(splitter));
            text = beforeAndAfter[0] + token.replacement + 
                    (beforeAndAfter.length == 2 ? beforeAndAfter[1] : ""); // in case where there are no more chars after splitter in text
        }
        System.out.println(text);
    }

    static class Token {
        public final int start, end;
        public final String replacement;

        public Token(int start, int end, String replacement) {
            this.start = start;
            this.end = end;
            this.replacement = replacement;
        }
    }
}
Coder-Man
  • 2,391
  • 3
  • 11
  • 19
  • But it isn't just one string I'm trying to replace. As I explained, I have a list of objects that specify start, end, and replacement substrings. Will this still work? – cs95 Jun 19 '18 at 05:34
  • Well imagine you had string "AA" and you put "BBB" in it's place, now you have to move all of your next strings' id's by one. You do not have to update your next strings' ids, you can simply store that delta in a seperate variable. – Coder-Man Jun 19 '18 at 05:35
  • I'm not the downvoter, but I'm not 100% sure this will work for my use case? – cs95 Jun 19 '18 at 05:38
  • *now you have to move all of your next strings' id's by one* please show what you mean – Scary Wombat Jun 19 '18 at 05:39
  • @coldspeed yes, i am, I thought it was obvious – Coder-Man Jun 19 '18 at 05:48
  • Cool. Thanks for your answer. I'll take a look at these answers and see how they compare to what I've already come up with. +1 – cs95 Jun 19 '18 at 05:49
  • 3
    I prefer @Robby version using a new output rather than having to worry about `deltas` – Scary Wombat Jun 19 '18 at 05:49
  • @ScaryWombat or I would iterate from the end to start (since ordered by ascending order) to don't care about the index delta either – AxelH Jun 19 '18 at 06:24