ignore repeating characters

Question

I am trying to make a swearing prevention system, so far I have ignored the whitespace (with "\s*") and I've ignored the case("(?i)"). How would I ignore repeated characters ? e.g heeeello.

I assume you want to end up matching your potential swear word(s) against a "dictionary of swear words", right? And you want to prevent "minor changes" in the swear word from getting past the filter. — Floris, Dec 31 '13 at 21:24
Be sure not to fall into the common trap of censoring assassin or what-have you because of the "ass" in it. — Adam Smith, Dec 31 '13 at 21:25
I'm not using a dictionary no, as the system I'm limited to doesn't allow that, as for assassin, it unfortunately would turn out as a**assin, I most likely won't be blocking ass however. — Brodie, Dec 31 '13 at 21:29
you can easily account for the word assassin by checking for spaces before and after — A.O., Dec 31 '13 at 21:31
You might want to check this [question](http://stackoverflow.com/questions/273516/how-do-you-implement-a-good-profanity-filter) out — HamZa, Dec 31 '13 at 21:31
@Brodie Because there are so many differences between different regex engines, it's very helpful if you specify what language / platform you're using when you ask a regex question. — p.s.w.g, Dec 31 '13 at 21:32
if it would work wrong, then fix it :). You can use a negative lookahead statement to check for "assin" e.g. `ass+(assin\S*)` will match `ass`, `asses` `assignments` (also a problem), but not `assassin`, `assassinate`, `assassination`, etc — Adam Smith, Dec 31 '13 at 21:34
@A.O. the issue with using that approach is that things like `jackass` and `assmunch` become problematic. It's very hard to do this right. — Adam Smith, Dec 31 '13 at 21:36
Thank you. As to the Regex engine, I myself, don't know. I'm using this program that requires an input and a config file that contains the regex patterns to check. Sorry — Brodie, Dec 31 '13 at 21:37
I have edited your title. Please see, "[Should questions include “tags” in their titles?](http://meta.stackexchange.com/questions/19190/)", where the consensus is "no, they should not". — John Saunders, Dec 31 '13 at 21:38
@adsmith again, you're right. But that wasnt my point, my point was that you CAN account for those cases, there's no reason that `assassin` should show up `a**assin` — A.O., Dec 31 '13 at 21:43

score 1 · Accepted Answer · answered Dec 31 '13 at 21:22

1

There is no flag that you can turn on to simply ignore any duplicate characters. However, you can use the 'one or more' quantifier (+) to match one or more occurrence of any character, character class, or group. For example the pattern he+l+o will match all of the following:

helo
heelo
hello
heeeello

answered Dec 31 '13 at 21:22

p.s.w.g

146,324
30
291
331

you might also want to use '*', which is the same as above, but accounts for 0 frequency as well, just so the OP knows! – A.O. Dec 31 '13 at 21:24
"sh*it" will censor `sit`. Don't do that. – Adam Smith Dec 31 '13 at 21:24
@adsmith obviously, i wasn't saying that he SHOULD use '*' i was just saying that there might be cases where it's needed instead of '+'. so it would catch words like btch, fck, mthrfckr and so onnnnnnnn – A.O. Dec 31 '13 at 21:28
Quantifiers should come *after* the character they modify. So if you want to match any sequence of `h e l o` ignoring repetitions or intervening whitespace, the pattern would be `h+\s*e+\s*l+\s*o+` – p.s.w.g Dec 31 '13 at 21:29
This one worked the best and was quite clear. Thanks – Brodie Dec 31 '13 at 21:38
If you worry about repeated characters anywhere in your string, then you need a `+` everywhere in your string... – Floris Dec 31 '13 at 21:42
@adsmith - I like your example, but AO was talking about `*` in the regex sense of the word, not the `f**k` censoring sense... – Floris Dec 31 '13 at 21:43
@Floris I was pretty clear about that in an earlier comment. My transmogrification tangent was just that -- tangential. I didn't think that using the "zero or more" regex character related at all with replacing the letter with an asterix – Adam Smith Dec 31 '13 at 21:45
1

@adsmith Now that's an interesting idea. I think I'll have to check that out. – p.s.w.g Dec 31 '13 at 21:46
One question I have, I'm trying to censor the work dick, and d i c k is censored, but d i i c k is not censored. Can anyone explain to me how to fix this ? This is my regex pattern: (?i)d+\s*i+\s*c+\s*k+ – Brodie Dec 31 '13 at 21:48
@Brodie Well in that case you'd have to use something like `(?i)d+(\s*i)+(\s*c)+(\s*k)+` – p.s.w.g Dec 31 '13 at 21:52
@Brodie it expects a "c" after the `d i `. It will catch `d iiii c k` because you've told it to expect one or more `i`s before the space. Try `(?i)(?:d+\s*)+(?:i+\s*)+(?:c+\s*)+(?:k+\s*)+` which encloses each letter and possible space in a non-capturing group and allows them to repeat. This syntax is subject to interpretation by your regex engine -- it may not work. – Adam Smith Dec 31 '13 at 21:54

score 0 · Answer 2 · answered Dec 31 '13 at 21:26

0

Assuming you want a general solution to remove repeated characters, you'll want to replace (.)\1 with \1 repeatedly as long as it succeeds.

answered Dec 31 '13 at 21:26

Peter Alfvin

28,599
8
68
106

1

And thus the banned word "poop" will be replaced with the OK word "pop", and pass the screen... – Floris Dec 31 '13 at 21:39
Yep. I was thinking of traditional profanity (in English) and couldn't think of a word with consecutive characters. In any event, I think producing word-specific regexes is nuts. – Peter Alfvin Dec 31 '13 at 22:35

score 0 · Answer 3 · answered Dec 31 '13 at 21:29

0

Use + to catch as many repetition of a sequence in () as there are. e+ will catch all the e's in heeeeello.

Check out rubular.com, very simple way to learn, practice and test regex.

answered Dec 31 '13 at 21:29

UKatz

613
1
4
14

score 0 · Answer 4 · answered Dec 31 '13 at 21:34

0

You need to capture a single character then check for any repetition of it with using a backreference to the lately captured group:

(.)\1+

If string is matched then it has repetition.

Live demo

answered Dec 31 '13 at 21:34

revo

47,783
14
74
117

score -1 · Answer 5 · answered Dec 31 '13 at 21:34

The problem is harder than you think. Let's assume that you want to match "no fewer than this number of characters" for each word in your dictionary. Then you would have to create a dictionary of regexes with a + after each character…

Initial dictionary:
boom
smurf
tree
cannibals

Process the dictionary with a text editor:

sed -e 's/\(.\)/\1\+/g' dictionary.txt > regex.txt

This puts a + between all characters:

b+o+o+m+
s+m+u+r+f+
t+r+e+e+
c+a+n+n+i+b+a+l+s+

And now you can match your "repeated" words:

bom : no match
smuuurf : match
trees   : no match
canibals : no match
cannnibalssss : match

You might want to add "word boundaries" - so that smurfette doesn't get caught by smurf. This would mean adding \b before and after each expression ("word boundary").

Note - it's not enough to remove all duplicate letters from both the dictionary, and the words to be matched - otherwise you risk banning "pop" because you had "poop" on your list (and how would you know to stop when "pooop" had reached exactly two characters). This is why I prefer this solution over some of the others that recommend stripping repeats.

I am impressed that I managed to harvest a down vote for this answer. Can anyone explain? — Floris, Dec 31 '13 at 21:44

ignore repeating characters

5 Answers5