3

I tried to get regex to work but couldn't (probably because i'm fairly new to regex).

Here's what i want to do:

Consider this text: One word, duel. Limes said bye.

Wanted matches: One word, duel. Limes said bye.

As mentioned previously in the title, i want to get consecutive words matched, one ending with (for example) with "t" and the other one starting with "t" as well, case insensitive.

The closest i got to the answer is with this expression [^a-z][a-z]*([a-z])[^a-z]+\1[a-z]*([a-z])[^a-z]+\2[a-z]*[^a-z]

nisser
  • 51
  • 5
  • You can try this regex `\w+(\w)\W+\1\w+` – Ulugbek Umirov Nov 19 '19 at 21:11
  • 1
    Do you consider a "word" a combination of letters, digits, underscores, or just letters/diacritics? – Wiktor Stribiżew Nov 19 '19 at 21:14
  • So, there are two expected matched, right? `word, duel` and `Limes said`? Or four: `["word", "duel", "Limes", "said"]`? – Wiktor Stribiżew Nov 19 '19 at 21:20
  • Consider a word a combination of letters, no digits, underscores. Plain letters. – nisser Nov 19 '19 at 21:33
  • four words expected – nisser Nov 19 '19 at 21:34
  • There was just a very similar question a few days ago: https://stackoverflow.com/q/58895677/5527985 (maybe [this pattern](https://regex101.com/r/Bw24sT/5) is of help for you). – bobble bubble Nov 19 '19 at 21:35
  • @nisser Does a word consists of at least 2 characters? – The fourth bird Nov 19 '19 at 21:40
  • word consists of at least 2 characters. – nisser Nov 19 '19 at 21:51
  • To get words with at least two characters individually in .NET regex you can also do something like [`\b(?:\w+(\w)(?=\W+\1\B)|(?<=\B(\w)\W+)\2\w+)`](http://regexstorm.net/tester?p=%5cb%28%3f%3a%5cw%2b%28%5cw%29%28%3f%3d%5cW%2b%5c1%5cB%29%7c%28%3f%3c%3d%5cB%28%5cw%29%5cW%2b%29%5c2%5cw%2b%29&i=One+word%2c+duel.+Limes+said+bye.&o=i) but there are nice answers already! – bobble bubble Nov 19 '19 at 22:06
  • Of course it is a different question from [this one](https://stackoverflow.com/q/58895677/5527985), just look at the requirements (output is separate word array, any non-word chars, not just spaces, between the words) and the suggested solution is not .NET-ready. – Wiktor Stribiżew Nov 20 '19 at 12:54
  • @nisser Please check if the solution below is working for you and if yes, please consider accepting. Else, let me know what is not working properly. – Wiktor Stribiżew Nov 20 '19 at 12:55

2 Answers2

3

You may use

(?i)\b(?<w>\p{L}+)(?:\P{L}+(?<w>(\p{L})(?<=\1\P{L}+\1)\p{L}*))+\b

See the regex demo. The results are in Group "w" capture collection.

Details

  • \b - a word boundary
  • (?<w>\p{L}+) - Group "w" (word): 1 or more BMP Unicode letters
  • (?:\P{L}+(?<w>(\p{L})(?<=\1\P{L}+\1)\p{L}*))+ - 1 or more repetitions of
    • \P{L}+ - 1 or more chars other than BMP Unicode letters
    • (?<w>(\p{L})(?<=\1\P{L}+\1)\p{L}*) - Group "w":
      • (\p{L}) - a letter captured into Group 1
      • (?<=\1\P{L}+\1) - immediately to the left of the current position, there must be the same letter as captured in Group 1, 1+ chars other than letters, and the letter in Group 1
      • \p{L}* - 0 or more letters
  • \b - a word boundary.

enter image description here

C# code demo:

var text = "One word, duel. Limes said bye.";
var pattern = @"\b(?<w>\p{L}+)(?:\P{L}+(?<w>(\p{L})(?<=\1\P{L}+\1)\p{L}*))+\b";
var result = Regex.Match(text, pattern, RegexOptions.IgnoreCase)?.Groups["w"].Captures
        .Cast<Capture>()
        .Select(x => x.Value);
Console.WriteLine(string.Join(", ", result)); // => word, duel, Limes, said

A C# demo version without using LINQ:

string text = "One word, duel. Limes said bye.";
string pattern = @"\b(?<w>\p{L}+)(?:\P{L}+(?<w>(\p{L})(?<=\1\P{L}+\1)\p{L}*))+\b";
Match result = Regex.Match(text, pattern, RegexOptions.IgnoreCase);
List<string> output = new List<string>();
if (result.Success) 
{
    foreach (Capture c in result.Groups["w"].Captures)
        output.Add(c.Value);
}
Console.WriteLine(string.Join(", ", output));
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Forgot to mention, LINQ is not allowed and (correct me if i'm wrong) this solution uses LINQ in x => x.Value. Is there a way to go around this? – nisser Nov 20 '19 at 14:10
1

If a word consists of at least 2 characters a-z, you might use 2 capturing groups with an alternation in a positive lookahead to check if the next word starts with the last char or if the previous word ended and the current word starts with the last char.

With case insensitive match enabled:

\b([a-z])[a-z]*([a-z])\b(?:(?=[,.]? \2)|(?<=\1 \1[a-z]+))
  • \b Word boundary
  • ([a-z]) Capture group 1 Match a-z
  • [a-z]* Match 0+ times a-z in between
  • ([a-z]) Capture group 2 Match a-z
  • \b Word boundary
  • (?: Non capturing group
    • (?= Positive lookahead, assert what is on the right is
      • [,.]? \2 an optional . or , space and what is captured in group 2
    • ) Close lookahead
    • | Or
    • (?<= Positive lookbehind, assert what is on the left is
      • \1 \1[a-z]+ Match what is captured in group 1 and space and 1+ times a char a-z
    • ) Close lookbehind
  • ) Close non capturing group

Regex demo

Note that matching [a-zA-Z] is a small range for a word. You might use \w or \p{L} instead.

enter image description here

The fourth bird
  • 154,723
  • 16
  • 55
  • 70