Find longest repeating strings?

Question

I have some HTML/CSS/JavaScript with painfully long class, id, variable and function names and other, combined strings that get used over and over. I could probably rename or restructure a few of them and cut the text in half.

So I'm looking for a simple algorithm that reports on the longest repeated strings in text. Ideally, it would reverse sort by length times instances, so as to highlight to strings that, if renamed globally, would yield the most savings.

This feels like something I could do painfully in 100 lines of code, for which there's some elegant, 10-line recursive regex. It also sounds like a homework problem, but I assure you it's not.

I work in PHP, but would enjoy seeing something in any language.

NOTE: I'm not looking for HTML/CSS/JavaScript minification per se. I like meaningful text, so I want to do it by hand, and weigh legibility against bloat.

The brute force method would be to start at position 0 and test if 0-1 were a repeated string. If yes, enter the pattern into an array with how many times it's repeated. Then try 0-2, 0-3, etc. Once the pattern isn't repeating, move the start position and do 1-2, etc. While you do this or after you throw away ones that add nothing (eg., if both hotdog and hot are repeated 10 times, you only keep hotdog). Blech. — LibraryThingTim, Jul 25 '09 at 14:50
Example: The blue elephant enjoyed eating hot dogs in the sun. The penguin enjoyed lying in the sun with the blue elephant. blue elephant x 2 in the sun x 2 enjoyed x 2 — LibraryThingTim, Jul 25 '09 at 14:52
I asked a similar question [here](http://stackoverflow.com/questions/117317/finding-common-blocks). Maybe there is something useful there. — Burkhard, Jul 25 '09 at 14:57

Markus Jarderot · Answer 1 · 2009-07-25T15:20:15.253

9

This will find all repeated strings:

(?=((.+)(?:.*?\2)+))

Use that with preg_match_all and select the longest one.

function len_cmp($match1,$match2) {
  return $match2[0] - $match1[0];
}

preg_match_all('/(?=((.+)(?:.*?\2)+))/s', $text, $matches, PREG_SET_ORDER);

foreach ($matches as $match) {
  $match[0] = substr_count($match[1], $match[2]) * strlen($match[2]);
}

usort($matches, "len_cmp");

foreach ($matches as $match) {
  echo "($matches[2]) $matches[1]\n";
}

This method could be quite slow though, as there could be a LOT of strings repeating. You could reduce it somewhat by specifying a minimum length, and a minimum number of repetitions in the pattern.

(?=((.{3,})(?:.*?\2){2,}))

This will limit the number of characters repeating to at least three, and the number of repetitions to three (first + 2).

Edit: Changed to allow characters between the repetitions.
Edit: Changed sorting order to reflect best match.

edited Jul 25 '09 at 15:20

answered Jul 25 '09 at 14:58

Markus Jarderot

86,735
21
136
138

That doesn't matter. It will still try all lengths. – Markus Jarderot Jul 25 '09 at 15:03
1

But you can limit that with `'{1,'.floor(strlen($text)/2).'}'`. – Gumbo Jul 25 '09 at 15:19
I'm still trying alternatives, mostly for speed, but this is very clever. Clearly standard compression algorithms are doing something less complete! – LibraryThingTim Jul 25 '09 at 16:14

score 0 · Answer 2 · answered Jul 25 '09 at 15:15

Seems I'm a little late, but it also does the work:

preg_match_all('/(id|class)+="([a-zA-Z0-9-_ ]+)"/', $html, $matches);

$result = explode(" ", implode(" ", $matches[2]));
$parsed = array();
foreach($result as $string) {
    if(isset($parsed[$string])) {
        $parsed[$string]++;
    } else {
        $parsed[$string] = 1;
    }
}
arsort($parsed);

foreach($parsed as $k => $v) {
    echo $k . " -> Found " . $v . " times<br/>";
}

The ouput will be something like:

some_id -> Found 2 times
some_class -> Found 2 times

Find longest repeating strings?

2 Answers2

Linked