-3

How can we search for repetitive patterns in word(s) using regex in order to detect "junk" or dummy words such as "gfgfgfgfg" and similar, but not limit creative words like "aweeesssoome", "omggg" etc.

Examples:

  1. In the case of "gfgfgfgfg" regex search / detection / result should be positive ("gf" base pattern detected, which ultimately constructs the entire word, mind the "hanging" final character "g")

  2. In the case of word "aweesooomee" it should return false, as no repetitive pattern is used to construct the entire word.


re possible duplicate mark by rsjaffe:

Question Detect repetitions in string has a generic and not so "smart" solution I am looking for. As explained above, the solution / variation I'm currently using considerably reduces false positives detection. Simple test in the link I've posted on regex101.com can serve as a proof and see why it does not satisfy my requirements.

Additional explanation:

Above method detects repetitions from neighboring words, as well, and limits creative ("valid") words, which is not a desirable effect.

Examples:

"this is" -- detects "is" as repetitions in 2 separate words ("is is" pattern match).

"awesoooommeee" -- detects repetitions of single letters like "o", "m" and "e".


Searching for this solution proved to be a bit hard to find, so I'm forced to ask the question.

First, a bit of a background story:

  • I run a blog
  • I have a post about reCaptcha
  • Sometimes (every week or so) someone tries to be funny and posts spam comments in the similar form to this:

gfgfgfgf

sdsdsdsds

dadadada

You get the idea. Are they testing an automated reCaptcha bypass systems as a proof of concept or just trying to be funny, I don't know and I don't really care (most probably a mix of both).

(edit) Interestingly enough, no other posts are affected by this type of spam comments.

However, thinking about this, it should be relatively simple and easy to detect patterns in the (mostly) single words which those comments have (99%) and prevent those comments from posting. Sounds simple?

But, it must also be good enough to avoid false positives.

If, for example, a comment has single repetitive word like above, then it's definitely a spam.

If, on the other hand, it just has a typo in the middle of the normal sentence, it should pass.

Now, I can already 'hear' comments below why not use Akismet. Or solution X. Or solution Y. Why not external comments system like Disqus or Facebook comments. Because, I can't. It must be in-house. And I wish to be simple. I already have some things that prevent a lot of junk, but for this particular case they all fail.

Solution(s) that I have tested so far:

This is one regex example that is a variant of this answer here, but it's not perfect:

(.+\w)(?=\1+)/gu

see live regex101 example

Problems with it is that in examples below it will pass most of the time, but it will trigger false positives, too:

correct/proper detection:

123123123123

daddaddaddad

sadsadasad

sadsadsad

121212121

sasasasasas

sdsdsdsds

dsdsdsdsd

ffffffff

blahblah

ioiooioioioi

popopopopop

Hi I dont think this is a spam.

improper/incorrect detection (false positive):

I loooovve this. It's awesooooommeee!

Now, this is tricky. The filter does exactly what it was instructed to do, however, "ooovv" and "oooommeee" patterns are not exactly repetitive in the same sense like the first ones listed above ("gfgfgfgf" etc.). Filter detects "oo" pattern repetition. Yes, correct, but not exactly what I want to target.

Does anyone have an idea how can I improve this regex detection to be smarter a bit?

Thanks!

Community
  • 1
  • 1
dev101
  • 1,359
  • 2
  • 18
  • 32
  • Possible duplicate of [Detect repetitions in string](https://stackoverflow.com/questions/9079797/detect-repetitions-in-string) – rsjaffe Aug 18 '18 at 00:20
  • Hi rsjaffe, thanks for contribution. I have already read that question and it is definitely not exactly what I'm asking here. Please read full question of mine to understand. The answer accepted in the possible duplicate question is inferior to the solution I am currently considering as best. Thanks! – dev101 Aug 18 '18 at 00:23
  • 1
    I suggest you tighten up your question. As it currently reads, that linked answer seems appropriate. A question can be very interesting (yours is) but too broad for the purposes of Stack Overflow, which is to develop a database of the best questions and answers to help future programmers seeking a solution. See https://meta.stackoverflow.com/questions/258589/breaking-down-too-broad-and-trying-to-understand-it#259857 for more on "too broad" questions. – rsjaffe Aug 18 '18 at 00:32
  • Can you point to the critical part of "too broad"? Is it because I haven't added specific language category (only generic regex) or overall? I tried my best to make it not "too broad" and post as much specifics through examples, but I am willing to make it more specific. In any case, I have some other ideas to approach this problem, but it cannot be solved with a single regex line. Thanks! – dev101 Aug 18 '18 at 00:36
  • The reason why that answer is not appropriate can be seen here: https://regex101.com/r/6BRTng/1 Every "not spam" line / example is falsely detected, and that is not what I need. – dev101 Aug 18 '18 at 00:44
  • I will modify the question to this: How can we restrict repetition detection to the whole words overall, not parts of it? This should limit the scope considerably, so in the case of "gfgfgfgfg" detection should be positive, and in the case of word "aweesooomee" it should be false. – dev101 Aug 18 '18 at 00:54
  • Tightening it down should help. One other suggestion I have is that you either shorten the text section or put the problem statement first and the background after. People tend to miss your point when the key sections are separated. I've had to extensively reedit some of my questions to address issues like these. It's a bit frustrating, but it improves the site for everyone. The key is to regard this site not as a place to get your question answered, but a place to develop great question-answer pairs of interest to other programmers. – rsjaffe Aug 18 '18 at 01:03
  • Thanks, I get it. I have added a direct and streamlined question in the opening, hopefully it will not be closed now. – dev101 Aug 18 '18 at 01:05
  • You might consider repurpose your answer to answer the question that was linked by me. That would help future people looking for answers. – rsjaffe Aug 18 '18 at 22:44
  • Hi rsjaffe, I have posted a more related question/answer in my Answer below, as it is actually much closer to what I was looking for. I have also edited main question again, to make it as clear as possible, and reference your proposition with explanation why it doesn't cut it for my case. Thanks! – dev101 Aug 19 '18 at 02:03

1 Answers1

1

I finally solved it! And with a single regex line :)

Searching for regex detect repetitive string I found the required clues.

This is the question: Matching on repeated substrings in a regex and the particular answer that inspired me to find a solution.

The solution is to use capturing groups and backreference in a slightly modified regex from above original answer in order to include both letters and numbers:

^([a-z0-9]{2,}).*(\1)$/gumi

Example: https://regex101.com/r/xG40cL/1

Another variation of above solution is to include single characters, so that both words with even and odd number of characters (even and odd symmetry) will also be matched (e.g. "ooo", "iii" etc.):

^([a-z0-9]{1,}).*(\1)$/gumi

Example: https://regex101.com/r/m9aqNk/1

It is still not perfect, but definitely better and closer to ideal case.

Sorry everyone for being such a pain, as now I understand the proper terminology I was seeking regarding regex (it's called backreference).

dev101
  • 1,359
  • 2
  • 18
  • 32