9

I want to align a pair of long texts with ~20M chars each.

I've used in the past Smith-Waterman algorithm but (from my limited understanding) it requires creating a 2-dimensional array with the size of the texts (20M by 20M array) - which is not practical.

So I'm looking for an algorithm to align a pair of long texts that will keep a practical memory size and execution time.

UPDATE

I've also tried Myers and Miller using this implementation: https://www.codeproject.com/Articles/42279/Investigating-Myers-diff-algorithm-Part-of But I still got out of memory exception on "not so large" texts (1MB).

Uddhav P. Gautam
  • 7,362
  • 3
  • 47
  • 64
Omri
  • 887
  • 8
  • 24
  • 2
    https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm recognises the limitations with large data sets and notes a couple of alternatives. Have you tried those? – AJD Jan 27 '18 at 20:49
  • 1
    Perhaps try MAFFT --text alignment. It should default to a Smith Waterman alignment (--localpair) and uses dynamic programming to greatly reduce computation time. The author is also in the process of developing an extended alphabet. https://mafft.cbrc.jp/alignment/software/manual/manual.html – Ghoti Feb 16 '18 at 19:22
  • @AJD, yes. I've tried Myers and Miller using this implementation: https://www.codeproject.com/Articles/42279/Investigating-Myers-diff-algorithm-Part-of and i still got out of memory exception... do you know other good implementations? – Omri Feb 24 '18 at 16:30
  • @Ghoti, that is interesting but it seems to focus on amino acid charset while i need more generic (utf8) charset – Omri Feb 24 '18 at 17:11
  • @Omri, MAFFT will autodetect nucleotide or amino acid sequences by default. Use --text mode for character alignment (UTF8 will need to be converted to an acceptable format). I'm not sure I intended to link the general user manual. Here's a better example of what the program is capable of (specifically look for "Extended alphabet"): https://mafft.cbrc.jp/alignment/software/textcomparison.html – Ghoti Feb 26 '18 at 17:09
  • 3
    Have you looked at [Hirschberg's Algorithm](https://en.wikipedia.org/wiki/Hirschberg%27s_algorithm)? It only requires linear memory. – SaiBot Mar 01 '18 at 08:30
  • Have you seen this https://github.com/jonschlinkert/align-text – norcal johnny Mar 03 '18 at 01:46
  • 1
    I have the feeling that what you try to accomplish is extremely hard if done without simplifications, probably not computable in the general case in reasonable time for an input size of 20M. Have you considered using text diff algorithms to find approximations of your desired result and would an approximation suit your use case? You tell us very little about it … – Alfe Mar 26 '19 at 11:20

0 Answers0