0

I have a dataset of lines that, due to a code bug has strings duplicated one or more times. The data starts with a capital, there are often multiple words then the string repeats. Some lines are ok and don't have repeating test. For instance, the data could be

The quick brown ox jumps over the lazy dogThe quick brown fox jumps over the lazy dog
ApplesApplesApples
IBM AT Computer
Lamp ShadeLamp Shade
OrangesOranges
I am a Potato

I have found multiple regular expressions for finding repeat words that stop at a predefined boundary \b or \w - that's pretty easy.

Finding repeating phrases of static length (e.g. two words that repeat, as in i am i am a potato) where there is a built-in boundary condition such as \w is also relatively easy. I have found examples of that such as \b(\w+(?:\s*\w*))\s+\1\b (demo https://regex101.com/r/4UIrxu/2). It fails if there are three repeats as in i am i am i am a potato and will only find the first occurence.

My phrases contain one or more words so the above phrase matcher won't work.

Is it possible to tell an expression that its boundary is a conditional that I make up - like a lower case letter followed by an uppercase letter (as in the T in dogThe) - which I can do with \B[a-z][A-Z]\B - that can then be used as a marker to test to see if the previous portion was repeated? I wasn't able to modify the repeating phrase pattern with this boundary condition, but maybe it is still possible.

frumbert
  • 2,323
  • 5
  • 30
  • 61
  • Does this answer your question? [Regex to find repeating numbers](https://stackoverflow.com/questions/6507982/regex-to-find-repeating-numbers) – Parzh from Ukraine Jun 17 '21 at 06:51
  • no. i need to make my own boundary, not use a built-in one (I think). – frumbert Jun 17 '21 at 07:05
  • 1
    Would `^(.*)\1+` help? It [solved](https://regex101.com/r/NqmhY1/1) all give samples. – JvdV Jun 17 '21 at 07:15
  • Your question contradicts itself. You speak of word boundaries, but your examples include `ApplesApplesApples`, which has no word boundaries between repeating terms. Please show example input and indicate what part of it should match, and show examples that should not match. – Bohemian Jun 17 '21 at 07:19

1 Answers1

2

This is very simple, but might provide a start:

/([A-Z].*)\1{1,}

See https://regex101.com/r/ynfuCO/1

This introduces the boundary condition:

/(?:^|(?<=[a-z]))([A-Z].*)\1{1,}

I've included start-of-line as well as a lowercase/Uppercase boundary, because that seems to match your requirements. See https://regex101.com/r/PBFDPY/2

The (?<=[a-z]) part is a positive look-behind (see eg https://www.regular-expressions.info/lookaround.html), which checks for a lower-case letter. You might need to adapt the character classes (I've just used [a-z] for simplicity, but often that's not adequate).

Chris Lear
  • 6,592
  • 1
  • 18
  • 26
  • `{1,}` is long hand for `+` – Bohemian Jun 17 '21 at 07:20
  • The positive look behind is the key that I was missing. I was seeing the lowercase-uppercase as the boundary rather than matching the capital-then-anything grouping and trying to put a lowercase negation behind it. My mistake (regex are still hard). – frumbert Jun 17 '21 at 23:17