Censoring selected words (replacing them with ****) using a single replaceAll?

Question

I'd like to censor some words in a string by replacing each character in the word with a "*". Basically I would want to do

String s = "lorem ipsum dolor sit";
s = s.replaceAll("ipsum|sit", $0.length() number of *));

so that the resulting s equals "lorem ***** dolor ***".

I know how to do this with repeated replaceAll invokations, but I'm wondering, is this possible to do with a single replaceAll?

Update: It's a part of a research case-study and the reason is basically that I would like to get away with a one-liner as it simplifies the generated bytecode a bit. It's not for a serious webpage or anything.

I foresee a "Scunthorpe problem" in your future. http://en.wikipedia.org/wiki/Scunthorpe_problem — Paul Tomblin, Jun 03 '10 at 13:09
and avoid this: http://www.alanbaxteronline.com/2008/07/01/homosexual-wins-100m.html — Sam Holder, Jun 03 '10 at 13:12
See also: http://stackoverflow.com/questions/2628534/codingbat-plusout-using-regex — polygenelubricants, Jun 03 '10 at 14:41

score 6 · Accepted Answer · edited May 23 '17 at 11:43

Here's a modification to aioobe's answer, using nested assertions instead of nested loop to generate the assertions:

public static void main(String... args) {
    String s = "lorem ipsum dolor sit blah $10 bleh";
    System.out.println(s.replaceAll(censorWords("ipsum", "sit", "$10"), "*"));
    // prints "lorem ***** dolor *** blah *** bleh"
}
public static String censorWords(String... words) {
    StringBuilder sb = new StringBuilder();
    for (String w : words) {
        if (sb.length() > 0) sb.append("|");
        sb.append(
           String.format("(?<=(?=%s).{0,%d}).",
              Pattern.quote(w),
              w.length()-1
           )
        );
    }
    return sb.toString();
}

Some key points:

StringBuilder.append in a loop instead of String +=
Pattern.quote to escape any $ or \ in censored words

That said, this is not the best solution to the problem. It's just a fun regex game to play, really.

How it works

We want to replace with "*", so we have to match one character at a time. The question is which character.

It's the character where if you go back long enough, and then you look forward, you see a censored word.

Here's the regex in more abstract form:

(?<=(?=something).{0,N})

This matches positions where, allowing you to go back up to N characters, you can lookahead and see something.

Beautiful! Would you care to break it a part and elaborate a bit? (The regexp I mean.) — aioobe, Jun 03 '10 at 14:51

aioobe · Answer 2 · 2010-06-03T14:35:23.073

4

It's possible using zero-width lookarounds:

public class Test {
    public static void main(String... args) {
        String s = "lorem ipsum dolor sit";
        System.out.println(s.replaceAll(censorWords("ipsum", "sit"), "*"));
    }

    public static String censorWords(String... words) {
        String re = "";
        for (String w : words)
            for (int i = 0; i < w.length(); i++)
                re += String.format("|((?<=%s)%s(?=%s))",
                        w.substring(0, i), w.charAt(i), w.substring(i + 1));
        return re.substring(1);
    }
}

Prints

lorem ***** dolor ***

The generated regular expression isn't pretty but it does the trick :-)

edited Jun 03 '10 at 14:35

answered Jun 03 '10 at 14:29

aioobe

413,195
112
811
826

Hehe.. @polygenelubricants, *you* taught me look-arounds! :-) (I know the resulting expression is ugly as hell, and that it's probably an extremely inefficient solution :) – aioobe Jun 03 '10 at 14:42
see my answer; it's a slight mod to your answer. It was taught to me by Alan Moore. – polygenelubricants Jun 03 '10 at 14:47
you can change it to re += String.format("|(?i)((?<=%s)%s(?=%s))", if you want the ignored case too – OWADVL Aug 22 '16 at 13:16

jjnguy · Answer 3 · 2010-06-03T13:42:20.153

3

This is not a good way to censor text. Jeff Atwood has a great post about censoring in this way.

http://www.codinghorror.com/blog/2008/10/obscenity-filters-bad-idea-or-incredibly-intercoursing-bad-idea.html

Unless you are going to spend lots and lots of time on this censoring feature it will probably end up censoring things that shouldn't be.

Another note:
Making the Java code into a 1-liner will not necessarily simplify the bytecode. Using that logic, you could throw your censoring code into a single method and then just use that.

edited Jun 03 '10 at 13:42

answered Jun 03 '10 at 13:09

jjnguy

136,852
53
295
323

It's a part of a case-study for a research project, and the reason is basically that I would like to have a simple one-liner as it simplifies the generated bytecode a bit. – aioobe Jun 03 '10 at 13:14
@aioobe Well, is it completely necessary to censor the document? Can you just do it by hand? – jjnguy Jun 03 '10 at 13:17
It's not a document, it's a part of a "dummy" chat server. As I said, nothing serious, just a part of a basic case-study in data-flow analysis. – aioobe Jun 03 '10 at 13:19
@aioobe I added a bit to my answer on the end you may wanna look at. – jjnguy Jun 03 '10 at 13:42

score 2 · Answer 4 · answered Jun 03 '10 at 13:17

2

Java's replace method doesn't take a callback as argument; so it isn't easy. But since profanity filters are mostly used on the web, I assume you can use JavaScript for that.

var s = "this is some sample text to play with";
var r = s.replace(/\b(some|sample|to)\b/g, function() {
  var star = "*";
  var len = arguments[1].length;
  while(--len)
    star += "*";
  return star;
});
console.log(r);//this is **** ****** text ** play with

answered Jun 03 '10 at 13:17

Amarghosh

58,710
11
92
121

Nice solution. Basically something like that I'm after. Unfortunately it's not a web-application and javascript is not involved :-/ – aioobe Jun 03 '10 at 13:22
I figured it out myself, using java (see the answer I posted). But thanks for your suggestion! – aioobe Jun 03 '10 at 14:32

Censoring selected words (replacing them with ****) using a single replaceAll?

4 Answers4

Related questions

How it works

Linked