How can we search for repetitive patterns in word(s) using regex in order to detect "junk" or dummy words such as "gfgfgfgfg" and similar, but not limit creative words like "aweeesssoome", "omggg" etc.
Examples:
In the case of "gfgfgfgfg" regex search / detection / result should be positive ("gf" base pattern detected, which ultimately constructs the entire word, mind the "hanging" final character "g")
In the case of word "aweesooomee" it should return false, as no repetitive pattern is used to construct the entire word.
re possible duplicate mark by rsjaffe:
Question Detect repetitions in string has a generic and not so "smart" solution I am looking for. As explained above, the solution / variation I'm currently using considerably reduces false positives detection. Simple test in the link I've posted on regex101.com can serve as a proof and see why it does not satisfy my requirements.
Additional explanation:
Above method detects repetitions from neighboring words, as well, and limits creative ("valid") words, which is not a desirable effect.
Examples:
"this is" -- detects "is" as repetitions in 2 separate words ("is is" pattern match).
"awesoooommeee" -- detects repetitions of single letters like "o", "m" and "e".
Searching for this solution proved to be a bit hard to find, so I'm forced to ask the question.
First, a bit of a background story:
- I run a blog
- I have a post about reCaptcha
- Sometimes (every week or so) someone tries to be funny and posts spam comments in the similar form to this:
gfgfgfgf
sdsdsdsds
dadadada
You get the idea. Are they testing an automated reCaptcha bypass systems as a proof of concept or just trying to be funny, I don't know and I don't really care (most probably a mix of both).
(edit) Interestingly enough, no other posts are affected by this type of spam comments.
However, thinking about this, it should be relatively simple and easy to detect patterns in the (mostly) single words which those comments have (99%) and prevent those comments from posting. Sounds simple?
But, it must also be good enough to avoid false positives.
If, for example, a comment has single repetitive word like above, then it's definitely a spam.
If, on the other hand, it just has a typo in the middle of the normal sentence, it should pass.
Now, I can already 'hear' comments below why not use Akismet. Or solution X. Or solution Y. Why not external comments system like Disqus or Facebook comments. Because, I can't. It must be in-house. And I wish to be simple. I already have some things that prevent a lot of junk, but for this particular case they all fail.
Solution(s) that I have tested so far:
This is one regex example that is a variant of this answer here, but it's not perfect:
(.+\w)(?=\1+)/gu
Problems with it is that in examples below it will pass most of the time, but it will trigger false positives, too:
correct/proper detection:
123123123123
daddaddaddad
sadsadasad
sadsadsad
121212121
sasasasasas
sdsdsdsds
dsdsdsdsd
ffffffff
blahblah
ioiooioioioi
popopopopop
Hi I dont think this is a spam.
improper/incorrect detection (false positive):
I loooovve this. It's awesooooommeee!
Now, this is tricky. The filter does exactly what it was instructed to do, however, "ooovv" and "oooommeee" patterns are not exactly repetitive in the same sense like the first ones listed above ("gfgfgfgf" etc.). Filter detects "oo" pattern repetition. Yes, correct, but not exactly what I want to target.
Does anyone have an idea how can I improve this regex detection to be smarter a bit?
Thanks!