1

I need to find longest common substring with greatest (length of substring * count).

For example, when I have string:

hi, hello world ... hello world ... hi, hello world

The answer is hello world, because (11 * 3) > (15 * 2).

I have found a relevant discussion in this question, but it is not practical to use in my case, because of its high memory usage.

Is there any better way how to do that?

Community
  • 1
  • 1
Jakub Tětek
  • 161
  • 1
  • 1
  • 7
  • are these suitable? http://stackoverflow.com/questions/19158025/longest-common-substring-in-two-sequences-of-strings – Stasik Dec 30 '13 at 10:35
  • In the [Wikipedia article](https://en.wikipedia.org/wiki/Longest_common_substring_problem) there is a pseudocode solution using dynamic programming. – BioGeek Dec 30 '13 at 10:39
  • A little. He is looking for longest substring. I'm looking for substring that takes most of space in the string (gratest length of substring * count) – Jakub Tětek Dec 30 '13 at 10:40
  • "it is not practical to use this because it's high memory use". I disagree. you will insert each string in the trie, which will have a number of characters smaller than the sum of characters of each string. And you can divide the problems if there are too many strings to fit in memory – UmNyobe Dec 30 '13 at 10:40
  • google "generalized suffix tree" – UmNyobe Dec 30 '13 at 10:44
  • This question appears to be off-topic because the assumption than an existing approach on SO has "high memory usage" is neither explained nor proven – UmNyobe Dec 30 '13 at 10:46
  • why is the answer not " hello world"? – Paul Hankin Dec 30 '13 at 11:51

1 Answers1

1

Here is a solution in perl. Probably it is not memory efficient but it works with the test string shown

use warnings;
use strict;

my $s="hi, hello world ... hello world ... hi, hello world";
my %h=();

#find the repeated strings, all of them
for (0..length($s) ) { 
    my $x=substr($s,$_); 
    for my  $m ($x=~/(.*).*\1/) { $h{$m}++} ; }

#find the count of each strings repeats
for my $f (keys %h) { $h{$f} = () = $s=~/\Q$f/g; }

#sort the length*count to find the best
my @ord=sort { length($b)*$h{$b} <=> length($a)*$h{$a} } keys %h;

#this one is the best
print $ord[0];
Vorsprung
  • 32,923
  • 5
  • 39
  • 63