How can I find the best fit subsequences of a large string?

Question

Say I have one large string and an array of substrings that when joined equal the large string (with small differences).

For example (note the subtle differences between the strings):

large_str = "hello, this is a long string, that may be made up of multiple
 substrings that approximately match the original string"

sub_strs = ["hello, ths is a lng strin", ", that ay be mad up of multiple",
 "subsrings tat aproimately ", "match the orginal strng"]

How can I best align the strings to produce a new set of sub strings from the original large_str? For example:

["hello, this is a long string", ", that may be made up of multiple",
 "substrings that approximately ", "match the original string"]

Additional Info

The use case for this is to find the page breaks of the original text from the existing page breaks of text extracted from a PDF document. Text extracted from the PDF is OCR'd and has small errors compared to the original text, but the original text does not have page breaks. The goal is to accurately page break the original text avoiding the OCR errors of the PDF text.

That might be a complicated task. At least I am not aware of any simple way to compare sections of a string. You might be able to compare sections of a string using a percentage to justify accuracy by comparing each character to a section of the large_str and see how many characters match consecutively — Mike - SMT, Aug 31 '17 at 21:18
Complicated to split the large string to compare individual sub-strings. But if you manage to do this, you can use Levenshtein distance to compare them. See https://en.wikipedia.org/wiki/Levenshtein_distance — Xvolks, Aug 31 '17 at 21:32
One way I could think of is based on page segmentation algorithm (also known as word wrap problem). Generally, for page segmentation, we have a function defined that calculates the cost of splitting up a text. But that function in this algorithm is based on number of whitespaces that occur in the text. I think we can have a similar approach but rather than having our split function defined on the base of whitespaces, we can have it designed based on similarity of strings combined with whitespaces. This can one of the approaches to start with and build the solution efficiently. — CodeHunter, Sep 06 '17 at 19:20

Anton · Accepted Answer · 2017-09-01T22:19:11.390

Concatenate the substrings
Align the concatenation with the original string
Keep track of which positions in the original string are aligned with the boundaries between the substrings
Split the original string on the positions aligned with those boundaries

An implementation using Python's difflib:

from difflib import SequenceMatcher
from itertools import accumulate

large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"

sub_strs = [
  "hello, ths is a lng strin",
  ", that ay be mad up of multiple",
  "subsrings tat aproimately ",
  "match the orginal strng"]

sub_str_boundaries = list(accumulate(len(s) for s in sub_strs))

sequence_matcher = SequenceMatcher(None, large_str, ''.join(sub_strs), autojunk = False)

match_index = 0
matches = [''] * len(sub_strs)

for tag, i1, i2, j1, j2 in sequence_matcher.get_opcodes():
  if tag == 'delete' or tag == 'insert' or tag == 'replace':
    matches[match_index] += large_str[i1:i2]
    while j1 < j2:
      submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      while submatch_len == 0:
        match_index += 1
        submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      j1 += submatch_len
  else:
    while j1 < j2:
      submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      while submatch_len == 0:
        match_index += 1
        submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      matches[match_index] += large_str[i1:i1+submatch_len]
      j1 += submatch_len
      i1 += submatch_len

print(matches)

Output:

['hello, this is a long string', 
 ', that may be made up of multiple ', 
 'substrings that approximately ', 
 'match the original string']

This seems to work well for the case that the substrings are shorter than the substrings produced by the original large string, but if they are longer the output will duplicate parts of the string. For example, `"hello, ths is a lng strinzzzzz"` becomes `"hello, this is a long string, th"`. — Josh Voigts, Sep 01 '17 at 14:43
@JoshVoigts Yes, I didn't take into account that in a `replace` opcode the lengths of the source string and the target string may be different. I updated the answer to address that.. — Anton, Sep 01 '17 at 22:17

igrinis · Answer 2 · 2017-09-06T14:32:39.197

You are trying to solve sequence alignment problem. In your case it is a "local" sequence alignment. It can be solved with Smith-Waterman approach. One possible implementation is here. If you run it, you receive:

large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"
sub_strs = ["hello, ths is a lng sin", ", that ay be md up of mulple", "susrings tat aproimately ", "manbvch the orhjgnal strng"]

for sbs in sub_strs:
    water(large_str, sbs)


 >>>

Identity = 85.185 percent
Score = 210
hello, this is a long strin
hello, th s is a l ng s  in
hello, th-s is a l-ng s--in

Identity = 84.848 percent
Score = 255
, that may be made up of multiple
, that  ay be m d  up of mul  ple
, that -ay be m-d- up of mul--ple

Identity = 83.333 percent
Score = 225
substrings that approximately 
su s rings t at a pro imately 
su-s-rings t-at a-pro-imately 

Identity = 75.000 percent
Score = 175
ma--tch the or-iginal string
ma   ch the or  g nal str ng
manbvch the orhjg-nal str-ng

The middle line shows matching characters. If you need the positions, look for max_i value to get ending position in original string. The starting position will be the value of i at the end of water() function.

I don't think this will work well if `large_str` or `sub_strs` contain much repetition, because this doesn't require that each `sub_str` be used exactly once, in order, etc. Consider: `large_str = "hi ha he"; sub_strs = ["hi", "hi", "hi"];` which will align each `sub_str` starting at index 0. — Alex Varga, Sep 07 '17 at 17:14
@AlexVarga no algorithm can guess what you have on your mind. If you expect `["hi", "hi", "hi"]` to be matched to `"hi ha he"`, than you should concatenate the substrings prior matching, or do iterative matching, cutting out newly found match out of the original string. Otherwise, I see no reason how first `"hi"` is different from the second and third, and why second `"hi"`should be matched differently from the first one. — igrinis, Sep 11 '17 at 05:46

mcdowella · Answer 3 · 2017-09-02T04:18:30.843

(The additional info makes a lot of the following unnecessary. It was written for a situation where the substrings provided might be any permutation of the order in which they occur in the main string)

There will be a dynamic programming solution for a problem very close to this. In the dynamic programming algorithm that gives you edit distance, the state of the dynamic program is (a, b) where a is the offset into the first string and b is the offset into the second string. For each pair (a, b) you work out the smallest possible edit distance that matches the first a characters of the first string with the first b characters of the second string, working out (a, b) from (a-1, b-1), (a-1, b), and (a, b-1).

You can now write a similar algorithm with state (a, n, m, b) where a is the total number of characters consumed by substrings so far, n is the index of the current substring, m is the position within the current substring, and b is the number of characters matched in the second string. This solves the problem of matching b against a string composed by pasting together any number of copies of any of the available substrings.

This is a different problem, because if you are trying to reconstitute a long string from fragments, you might get a solution that uses the same fragment more than once, but if you are doing this you might hope that the answer is obvious enough that the collection of substrings it produces happens to be a permutation of the collection given to it.

Because the edit distance returned by this method will always be at least as good as the best edit distance when you force a permutation, you could also use this to compute a lower bound on the best possible edit distance for a permutation, and run a branch and bound algorithm to find the best permutation.

Modify the Longest Common Sub sequence – Dhruv Singh Sep 01 '17 at 05:06 — Dhruv Singh, Sep 01 '17 at 05:06

How can I find the best fit subsequences of a large string?

3 Answers3