0

i am only practicing my regex so there is no real "question" as such. I don't want advising other NET methods can do this. This is all about me learning so please dont answer if it's not related to regex. Thank you.

I gave my self the task of matching duplicate words. I did this hard coding the word but asked my self what if we wanted ALL words. What i attempted to do was back reference into group One the first word and went from there. Been struggling all night.

An example word text format would be "The quick Brown Fox Jump over the Brown fence." As we can see The and Brown is repeated twice.

Expression:

(?i)(?<=\s*\1\.*)\s+(\w+)

Any tips or advice where i am going wrong is great. I got regex buddy fired up but still struggling. I am using VB.Net

Ňɏssa Pøngjǣrdenlarp
  • 38,411
  • 12
  • 59
  • 178
Sam Johnson
  • 307
  • 1
  • 2
  • 8

2 Answers2

1

what you have used in your codes are not "positive look-ahead", it is "look behind".

I have no experience with vb.net. but not all regex engines support look behind with dynamic length (like .*)

However your problem could be solved by positive look-ahead:

(\b\w+\b)(?=.*?\1)

I don't have Windows, just try with grep's -P (PCRE) and -i (ignore case) options:

kent$ echo "The quick Brown Fox Jump over the Brown fence."|grep -iPo '(\b\w+\b)(?=.*?\1)'  
The
Brown
Kent
  • 189,393
  • 32
  • 233
  • 301
0

What you have is actually a look-behind, not a look-ahead. Your approach might still work, however, in .NET, back references (\1) need to after the group they reference. It would also help to use word boundaries (\b) rather than testing for whitespace (\s) around the word characters.

At first glance, it seems like you might be able solve this by putting the the capturing group inside the look-behind:

(?i)(?<=\b(\w+)\b.*)\1

Although because of the greedy .* inside the look-behind, the first group will only match the first word in the string (the The). So this is effectively equivalent to (?i)\b(\w+)\b.*\1. Making it non-greedy (.*?) will cause it to only match the two consecutive instances of the same word.

The solution then is to simply use a look-ahead like this:

(?i)\b(\w+)\b(?=.*\1)

And just in case for whatever reason you needed to get the second word rather than the first, this can be accomplished by simply putting a second capture group inside the look-ahead

Further Reading

p.s.w.g
  • 146,324
  • 30
  • 291
  • 331
  • Full disclosure: [I asked a question](http://stackoverflow.com/questions/18344034/do-backreferences-need-to-come-after-the-group-they-reference) about this very topic a while back, but I don't feel this question is a duplicate because it seems to be asking about a much more general concern. – p.s.w.g May 07 '14 at 22:00