2

I'd like to censor some words in a string by replacing each character in the word with a "*". Basically I would want to do

String s = "lorem ipsum dolor sit";
s = s.replaceAll("ipsum|sit", $0.length() number of *));

so that the resulting s equals "lorem ***** dolor ***".

I know how to do this with repeated replaceAll invokations, but I'm wondering, is this possible to do with a single replaceAll?


Update: It's a part of a research case-study and the reason is basically that I would like to get away with a one-liner as it simplifies the generated bytecode a bit. It's not for a serious webpage or anything.

aioobe
  • 413,195
  • 112
  • 811
  • 826

4 Answers4

6

Here's a modification to aioobe's answer, using nested assertions instead of nested loop to generate the assertions:

public static void main(String... args) {
    String s = "lorem ipsum dolor sit blah $10 bleh";
    System.out.println(s.replaceAll(censorWords("ipsum", "sit", "$10"), "*"));
    // prints "lorem ***** dolor *** blah *** bleh"
}
public static String censorWords(String... words) {
    StringBuilder sb = new StringBuilder();
    for (String w : words) {
        if (sb.length() > 0) sb.append("|");
        sb.append(
           String.format("(?<=(?=%s).{0,%d}).",
              Pattern.quote(w),
              w.length()-1
           )
        );
    }
    return sb.toString();
}

Some key points:

  • StringBuilder.append in a loop instead of String +=
  • Pattern.quote to escape any $ or \ in censored words

That said, this is not the best solution to the problem. It's just a fun regex game to play, really.

Related questions


How it works

We want to replace with "*", so we have to match one character at a time. The question is which character.

It's the character where if you go back long enough, and then you look forward, you see a censored word.

Here's the regex in more abstract form:

(?<=(?=something).{0,N})

This matches positions where, allowing you to go back up to N characters, you can lookahead and see something.

Community
  • 1
  • 1
polygenelubricants
  • 376,812
  • 128
  • 561
  • 623
4

It's possible using zero-width lookarounds:

public class Test {
    public static void main(String... args) {
        String s = "lorem ipsum dolor sit";
        System.out.println(s.replaceAll(censorWords("ipsum", "sit"), "*"));
    }

    public static String censorWords(String... words) {
        String re = "";
        for (String w : words)
            for (int i = 0; i < w.length(); i++)
                re += String.format("|((?<=%s)%s(?=%s))",
                        w.substring(0, i), w.charAt(i), w.substring(i + 1));
        return re.substring(1);
    }
}

Prints

lorem ***** dolor ***

The generated regular expression isn't pretty but it does the trick :-)

aioobe
  • 413,195
  • 112
  • 811
  • 826
  • Hehe.. @polygenelubricants, *you* taught me look-arounds! :-) (I know the resulting expression is ugly as hell, and that it's probably an extremely inefficient solution :) – aioobe Jun 03 '10 at 14:42
  • see my answer; it's a slight mod to your answer. It was taught to me by Alan Moore. – polygenelubricants Jun 03 '10 at 14:47
  • you can change it to re += String.format("|(?i)((?<=%s)%s(?=%s))", if you want the ignored case too – OWADVL Aug 22 '16 at 13:16
3

This is not a good way to censor text. Jeff Atwood has a great post about censoring in this way.

http://www.codinghorror.com/blog/2008/10/obscenity-filters-bad-idea-or-incredibly-intercoursing-bad-idea.html

Unless you are going to spend lots and lots of time on this censoring feature it will probably end up censoring things that shouldn't be.

Another note:
Making the Java code into a 1-liner will not necessarily simplify the bytecode. Using that logic, you could throw your censoring code into a single method and then just use that.

jjnguy
  • 136,852
  • 53
  • 295
  • 323
  • It's a part of a case-study for a research project, and the reason is basically that I would like to have a simple one-liner as it simplifies the generated bytecode a bit. – aioobe Jun 03 '10 at 13:14
  • @aioobe Well, is it completely necessary to censor the document? Can you just do it by hand? – jjnguy Jun 03 '10 at 13:17
  • It's not a document, it's a part of a "dummy" chat server. As I said, nothing serious, just a part of a basic case-study in data-flow analysis. – aioobe Jun 03 '10 at 13:19
  • @aioobe I added a bit to my answer on the end you may wanna look at. – jjnguy Jun 03 '10 at 13:42
2

Java's replace method doesn't take a callback as argument; so it isn't easy. But since profanity filters are mostly used on the web, I assume you can use JavaScript for that.

var s = "this is some sample text to play with";
var r = s.replace(/\b(some|sample|to)\b/g, function() {
  var star = "*";
  var len = arguments[1].length;
  while(--len)
    star += "*";
  return star;
});
console.log(r);//this is **** ****** text ** play with
Amarghosh
  • 58,710
  • 11
  • 92
  • 121
  • Nice solution. Basically something like that I'm after. Unfortunately it's not a web-application and javascript is not involved :-/ – aioobe Jun 03 '10 at 13:22
  • I figured it out myself, using java (see the answer I posted). But thanks for your suggestion! – aioobe Jun 03 '10 at 14:32