1

I would like to select the parts of a string covered by a set of substrings with the following properties:

  • They belong in the original string.
  • They may have different lengths and positions.
  • They can overlap.
  • They may not be ordered as they appear in the original string.

For example:

string = "MGLSDGEWQQVLNVWGKVEADIAGHGQEVLIHSKHPGDFGADAQGAMTKALELFRNDIAAKYKELGFQG"
substring1 = "HPGDFGADAQGAMTKALELFR"
substring2 = "GEWQQVLNVWGK"
substringn = "ALELFRNDIAAKYK"

And I would like to get:

coverage = "MGLSD<b>GEWQQVLNVWGK</b>VEADIAGHGQEVLIHSK<b>HPGDFGADAQGAMTKALELFRNDIAAKYK</b>ELGFQG"

I tried to extract the positions of the substrings within the string like this:

substrings_array.each do |substring|
  start_pos = string.index substring
  end_pos = string.length - (string.reverse.index(substring.reverse) )
end

and that, way I get a start and an end position for each substring. How could I merge them all, especially considering they may overlap and appear in different orders? Is this even a good strategy?

sawa
  • 165,429
  • 45
  • 277
  • 381
Vital V
  • 235
  • 1
  • 4
  • 15
  • 3
    This looks like good starting strategy, especially if you turn your `.each` to a `.map` and return a list of start/end positions that need to be marked. Then your problem becomes "How do I merge these ranges so that overlapped ones combine into single larger ones". – Neil Slater Feb 13 '14 at 12:25
  • I would say you are 80% through so what you want to do is at the start_pos insert `` and at the end_pos insert `<\b>` and that seems job done, you can `clone` the string and modify the cloned string, then return the cloned string as your answer, say `new_string = string`, then after getting your `start_pos` and `end_pos` you want to insert the tags in the `new_string` the ruby doc for insert is http://ruby-doc.org/core-2.0/String.html#method-i-insert – bjhaid Feb 13 '14 at 12:34

1 Answers1

1

This should work (not pretty, but it works):

string = "MGLSDGEWQQVLNVWGKVEADIAGHGQEVLIHSKHPGDFGADAQGAMTKALELFRNDIAAKYKELGFQG"
substring1 = "HPGDFGADAQGAMTKALELFR"
substring2 = "GEWQQVLNVWGK"
substring3 = "ALELFRNDIAAKYK"

substrings = [substring1, substring2, substring3]

overlapping_indexes = substrings.map do |substring|
  start_pos = string.index substring
  end_pos = start_pos + substring.length
  (start_pos..end_pos)
end

# the following 3 methods are from Wayne Conrad in this question: http://stackoverflow.com/questions/6017523/how-to-combine-overlapping-time-ranges-time-ranges-union

def ranges_overlap?(a, b)
  a.include?(b.begin) || b.include?(a.begin)
end

def merge_ranges(a, b)
  [a.begin, b.begin].min..[a.end, b.end].max
end

def merge_overlapping_ranges(ranges)
  ranges.sort_by(&:begin).inject([]) do |ranges, range|
    if !ranges.empty? && ranges_overlap?(ranges.last, range)
      ranges[0...-1] + [merge_ranges(ranges.last, range)]
    else
      ranges + [range]
    end
  end
end

indexes = merge_overlapping_ranges(overlapping_indexes)

x = "<b>"
y = "</b>"
offset = 0

indexes.each do |index|
  string.insert(index.begin + offset, x)
  offset += x.length
  string.insert(index.end + offset, y)
  offset += y.length
end

p string
Powers
  • 18,150
  • 10
  • 103
  • 108
  • This will not work. An inserted tag will destroy the original string, avoiding further match with another substring that overlaps with the current substring. – sawa Feb 13 '14 at 13:00