2

I have a long string on a text file (DNA sequence, over 20 000 characters) and I'm trying to find the longest sequence in it that is repeated at least three times. What would be the best way to accomplish this?

The only existing topics I can find are for finding repetitions in two or more separate strings, but how does one do it with one long string?

Ibexion
  • 23
  • 4
  • Is "longest sequence" a specific term or is it equivalent to "longest substring"? And is the sequence / substring allowed to overlap? – Stefan Dec 16 '15 at 08:08

3 Answers3

0

You are looking to solve the 'longest repeated substring problem' if I understand correctly: https://en.wikipedia.org/wiki/Longest_repeated_substring_problem

Take a look at http://rubyquiz.com/quiz153.html

This gem might help you solve the problem: https://github.com/luikore/triez

CTRL + F : "Solve the longest common substring problem:"

Laurens
  • 2,420
  • 22
  • 39
0

To be precise you are looking for the longest 3-fold substring.

A k-fold substring of a string s is a repeated substring of s that appears at least k times in s

There was a similar python question a few years ago that would give you a lot of good information. Specifically take a look at Finding the Longest Multiple Repeat. There is a solution on GitHub but it is also in Python.

Community
  • 1
  • 1
Mike S
  • 11,329
  • 6
  • 41
  • 76
0
str = "ababaeabadefgdefaba"  

Case 1: substrings of a given length cannot overlap

n = 3
n.times.
  flat_map { |i| str[i..-1].each_char.each_cons(n).to_a }.
  uniq.
  each_with_object({}) do |a,h|
    r = /#{a.join('')}/
    h[a.join('')] = str.scan(r).size
end.max_by { |_,v| v }
  #=> ["aba", 3]

Case 2: substrings of a given length can overlap

It is only necessary to change the line defining the regex:

n = 3
n.times.
  flat_map { |i| str[i..-1].each_char.each_cons(n).to_a }.
  uniq.
  each_with_object({}) do |a,h|
    r = /#{a.first}(?=#{a.drop(1).join('')})/
    h[a.join('')] = str.scan(r).size
  end.max_by { |_,v| v }
  #=> ["aba", 4]

Consider the steps performed in Case 2:

n = 3
b = n.times
  #=> #<Enumerator: 3:times> 
c = b.flat_map { |i| str[i..-1].each_char.each_cons(n).to_a }
  #=> [["a", "b", "a"], ["b", "a", "b"], ["a", "b", "a"], ["b", "a", "e"],
  #    ["a", "e", "a"], ["e", "a", "b"], ["a", "b", "a"], ["b", "a", "d"],
  #    ["a", "d", "e"], ["d", "e", "f"], ["e", "f", "g"], ["f", "g", "d"],
  #    ["g", "d", "e"], ["d", "e", "f"], ["e", "f", "a"], ["f", "a", "b"],
  #    ["a", "b", "a"], ["b", "a", "b"], ["a", "b", "a"], ["b", "a", "e"],
  #    ["a", "e", "a"], ["e", "a", "b"], ["a", "b", "a"], ["b", "a", "d"],
  #    ["a", "d", "e"], ["d", "e", "f"], ["e", "f", "g"], ["f", "g", "d"],
  #    ["g", "d", "e"], ["d", "e", "f"], ["e", "f", "a"], ["f", "a", "b"],
  #    ["a", "b", "a"], ["a", "b", "a"], ["b", "a", "e"], ["a", "e", "a"],
  #    ["e", "a", "b"], ["a", "b", "a"], ["b", "a", "d"], ["a", "d", "e"],
  #    ["d", "e", "f"], ["e", "f", "g"], ["f", "g", "d"], ["g", "d", "e"],
  #    ["d", "e", "f"], ["e", "f", "a"], ["f", "a", "b"], ["a", "b", "a"]]
d = c.uniq
  #=> [["a", "b", "a"], ["b", "a", "b"], ["b", "a", "e"], ["a", "e", "a"],
  #    ["e", "a", "b"], ["b", "a", "d"], ["a", "d", "e"], ["d", "e", "f"], 
  #    ["e", "f", "g"], ["f", "g", "d"], ["g", "d", "e"], ["e", "f", "a"],
  #    ["f", "a", "b"]] 
e = d.each_with_object({}) do |a,h|
      r = /#{a.first}(?=#{a.drop(1).join('')})/
      puts "  str.scan(#{r.inspect}) = #{str.scan(r)}" if a == d.first
      h[a.join('')] = str.scan(r).size
      puts "  h[#{a.join('')}] = #{h[a.join('')]}" if a == d.first
    end
  #=>   str.scan(/a(?=ba)/) = ["a", "a", "a", "a"]
  #=>   h[aba] = 4
  #=> {"aba"=>4, "bab"=>1, "bae"=>1, "aea"=>1, "eab"=>1, "bad"=>1, "ade"=>1,
  #    "def"=>2, "efg"=>1, "fgd"=>1, "gde"=>1, "efa"=>1, "fab"=>1}
e.max_by { |_,v| v }
  #=> ["aba", 4] 

In computing e, for the first element of d passed to the block, the block variable a equals ["a", "b", "a"] and the regex /a(?=ba)/ in str.scan(/a(?=ba)/) matches every a in str that is followed by ba. (?=ba) is a positive lookahead (not part of the match).

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100