1

Idea. Given the string, return all the matches (with overlaps) and the text before these matches.

Example. For the text atatgcgcatatat and the query atat there are three matches, and the desired output is atat, atatgcgcatat and atatgcgcatatat.

Problem. I use Ruby 2.2 and String#scan method to get multiple matches. I've tried to use lookahead, but the regex /(?=(.*?atat))/ returns every substring that ends with atat. There must be some regex magic to solve this problem, but I can't figure out the right spell.

Daniel
  • 328
  • 3
  • 11

4 Answers4

4

I believe this is at least better than the OP's answer:

text = "atatgcgcatatat"
query = "atat"

res = []
text.scan(/(?=#{query})/){res.push($` + query)}                                  #`
res # => ["atat", "atatgcgcatat", "atatgcgcatatat"]
sawa
  • 165,429
  • 45
  • 277
  • 381
  • It definitely is! I implemented other ideas in Ruby too and added them to my answer. – Daniel Sep 11 '15 at 13:08
  • I guess, there's no need for intermediate array: `text.to_enum(:scan, /(?<=atat)/).map { $\` }` – Daniel Sep 11 '15 at 13:11
  • 2
    You could use `tap` as I did: `[].tap { |a| text.scan(/(?=#{query})/) {a << $\` + query} } #=> ["atat", "atatgcgcatat", "atatgcgcatatat"]`. – Cary Swoveland Sep 11 '15 at 19:16
3

Given the nature and purpose of regex, there is no way to do that. When a regex matches text, there is no way to include the same text in another match. Therefore, the best option that I can think of is to use a look-behind to find the ending position of each match:

(?<=atat)

With your example input of atatgcgcatatat, that would return the following three matches:

  • Position 4, Length 0
  • Position 12, Length 0
  • Position 14, Length 0

You could then loop through those results, get the position for each one, and then get the sub-string that starts at the beginning of the input string and ends at that position. If you don't know how to get the positions of each match, you may find the answers to this question helpful.

Community
  • 1
  • 1
Steven Doggart
  • 43,358
  • 8
  • 68
  • 105
  • But how do I get multiple matches? `"atatgcgcatatat".scan /.*atat/ #=> ["atatgcgcatatat"]` – Daniel Sep 11 '15 at 12:12
  • Thank you, that was helpful! Since there's no option to get the output in just one go, I've also found a way to get the result, and it's slightly different from what you've proposed. I'll post it as a separate answer. – Daniel Sep 11 '15 at 12:36
  • Thanks, @Nakilon, that was useful! – Daniel Sep 11 '15 at 13:00
1

You could do this:

str = 'atatgcgcatatat'
target = 'atat'

[].tap do |a|
  str.gsub(/(?=#{target})/) { a << str[0, $~.end(0)+target.size] }
end
  #=> ["atat", "atatgcgcatat", "atatgcgcatatat"]

Notice that the string returned by gsub is discarded.

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
0

It seems, there's no way to solve the problem in just one go.

One possible solution is to use this knowledge to get indices of matches when using String#scan, and then return the array of sliced strings:

def find_by_end text, query
    res = []
    n = query.length
    text.scan( /(?=(#{query}))/ ) do |m|
        res << text.slice(0, $~.offset(0).first + n)
    end
    res
end

find_by_end "atatgcgcatatat", "atat" #=> ["atat", "atatgcgcatat", "atatgcgcatatat"]

A slightly different solution was proposed by @StevenDoggart. Here's a nice and short code which uses this hack to solve the problem:

"atatgcatatat".to_enum(:scan, /(?<=atat)/).map { $` }                         #`
#=> ["atat", "atatgcatat", "atatgcatatat"]

As @CasimiretHippolyte notes, reversing the string might help to solve the problem. It actually does, but it's hardly the prettiest solution:

"atatgcatatat".reverse.scan(/(?=(tata.*))/).flatten.map(&:reverse).reverse
#=> ["atat", "atatgcatat", "atatgcatatat"]
Community
  • 1
  • 1
Daniel
  • 328
  • 3
  • 11