2

I'm trying to extract all posible combinations of 3 letters from a String following the pattern XYX.

val text = "abaca dedfd ghgig"
val p = """([a-z])(?!\1)[a-z]\1""".r
p.findAllIn(text).toArray

When I run the script I get:

aba, ded, ghg

And it should be:

aba, aca, ded, dfd, ghg, gig

It does not detect overlapped combinations.

Oscar H
  • 104
  • 5
  • 1
    Are you sure that you need any 3 letter combinations? You want the 3rd letter to be the same as the first one, from what I see. – Wiktor Stribiżew Dec 20 '16 at 13:45
  • The Scaladoc for Regex says to see the doc on findAllIn for example of overlapping matches. http://www.scala-lang.org/api/current/scala/util/matching/Regex.html#findAllIn(source:CharSequence):scala.util.matching.Regex.MatchIterator – som-snytt Dec 20 '16 at 19:40

2 Answers2

3

The way consists to enclose the whole pattern in a lookahead to consume only the start position:

val p = """(?=(([a-z])(?!\2)[a-z]\2))""".r
p.findAllIn(text).matchData foreach {
   m => println(m.group(1))
}

The lookahead is only an assertion (a test) for the current position and the pattern inside doesn't consume characters. The result you are looking for is in the first capture group (that is needed to get the result since the whole match is empty).

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • This is it, thanks. It works perfect. Regex is amazing, looks like black magic to me. I still have problems getting my head around this concept. When I try to use this pattern in some tools like Sublime Text search, or http://regexr.com/ it does not work. So I have to get better at understanding this. ps: I needed this to help me solve Day 7 of advent of code. http://adventofcode.com/2016/day/7 – Oscar H Dec 20 '16 at 23:14
  • @oxcarh: You must understand that a lookahead assertion is only a test and doesn't consume characters (in other words, the whole match is always an empty string). It works well with sublime text (I just tested) but gives an error with regexr.com (that is probably buggy). – Casimir et Hippolyte Dec 20 '16 at 23:34
2

You need to capture the whole pattern and put it inside a positive lookahead. The code in Scala will be the following:

object Main extends App {
    val text = "abaca dedfd ghgig"
    val p = """(?=(([a-z])(?!\2)[a-z]\2))""".r
    val allMatches = p.findAllMatchIn(text).map(_.group(1))
    println(allMatches.mkString(", "))
    // => aba, aca, ded, dfd, ghg, gig
}

See the online Scala demo

Note that the backreference will turn to \2 as the group to check will have ID = 2 and Group 1 will contain the value you need to collect.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I have read your comments below CH's answer, and must tell you that regexr.com only supports plain JS regex without actually suggesting *any* workarounds for known limitations. One of those limitations is the ability to collect zero-width matches, even if they contain non-empty captured groups. See [Regexr.com shows my regex can match 0 characters, and therefore matches infinitely](http://stackoverflow.com/questions/34495675/zero-length-regexes-and-infinite-matches/34495840#34495840). To use it there, just add a `.` at the end: [`/(?=(([a-z])(?!\2)[a-z]\2))./g`](http://regexr.com/3etul). – Wiktor Stribiżew Dec 21 '16 at 07:35
  • If you explain what you need to do with the string and regex in Sublime Text, I might be able to help. Do you want to create the lists of those overlapping texts and remove all the text around? – Wiktor Stribiżew Dec 21 '16 at 07:36