3

I have some HTML/CSS/JavaScript with painfully long class, id, variable and function names and other, combined strings that get used over and over. I could probably rename or restructure a few of them and cut the text in half.

So I'm looking for a simple algorithm that reports on the longest repeated strings in text. Ideally, it would reverse sort by length times instances, so as to highlight to strings that, if renamed globally, would yield the most savings.

This feels like something I could do painfully in 100 lines of code, for which there's some elegant, 10-line recursive regex. It also sounds like a homework problem, but I assure you it's not.

I work in PHP, but would enjoy seeing something in any language.

NOTE: I'm not looking for HTML/CSS/JavaScript minification per se. I like meaningful text, so I want to do it by hand, and weigh legibility against bloat.

alain.janinm
  • 19,951
  • 10
  • 65
  • 112
LibraryThingTim
  • 413
  • 1
  • 4
  • 10
  • The brute force method would be to start at position 0 and test if 0-1 were a repeated string. If yes, enter the pattern into an array with how many times it's repeated. Then try 0-2, 0-3, etc. Once the pattern isn't repeating, move the start position and do 1-2, etc. While you do this or after you throw away ones that add nothing (eg., if both hotdog and hot are repeated 10 times, you only keep hotdog). Blech. – LibraryThingTim Jul 25 '09 at 14:50
  • 2
    Example: The blue elephant enjoyed eating hot dogs in the sun. The penguin enjoyed lying in the sun with the blue elephant. blue elephant x 2 in the sun x 2 enjoyed x 2 – LibraryThingTim Jul 25 '09 at 14:52
  • I asked a similar question [here](http://stackoverflow.com/questions/117317/finding-common-blocks). Maybe there is something useful there. – Burkhard Jul 25 '09 at 14:57

2 Answers2

9

This will find all repeated strings:

(?=((.+)(?:.*?\2)+))

Use that with preg_match_all and select the longest one.

function len_cmp($match1,$match2) {
  return $match2[0] - $match1[0];
}

preg_match_all('/(?=((.+)(?:.*?\2)+))/s', $text, $matches, PREG_SET_ORDER);

foreach ($matches as $match) {
  $match[0] = substr_count($match[1], $match[2]) * strlen($match[2]);
}

usort($matches, "len_cmp");

foreach ($matches as $match) {
  echo "($matches[2]) $matches[1]\n";
}

This method could be quite slow though, as there could be a LOT of strings repeating. You could reduce it somewhat by specifying a minimum length, and a minimum number of repetitions in the pattern.

(?=((.{3,})(?:.*?\2){2,}))

This will limit the number of characters repeating to at least three, and the number of repetitions to three (first + 2).

Edit: Changed to allow characters between the repetitions.
Edit: Changed sorting order to reflect best match.

Markus Jarderot
  • 86,735
  • 21
  • 136
  • 138
0

Seems I'm a little late, but it also does the work:

preg_match_all('/(id|class)+="([a-zA-Z0-9-_ ]+)"/', $html, $matches);

$result = explode(" ", implode(" ", $matches[2]));
$parsed = array();
foreach($result as $string) {
    if(isset($parsed[$string])) {
        $parsed[$string]++;
    } else {
        $parsed[$string] = 1;
    }
}
arsort($parsed);

foreach($parsed as $k => $v) {
    echo $k . " -> Found " . $v . " times<br/>";
}

The ouput will be something like:

some_id -> Found 2 times
some_class -> Found 2 times
Nathan
  • 4,017
  • 2
  • 25
  • 20