8

I'm trying to figure out a way to detect groups of files. For instance:

If a given directory has the following files:

  • Birthday001.jpg
  • Birthday002.jpg
  • Birthday003.jpg
  • Picknic1.jpg
  • Picknic2.jpg
  • Afternoon.jpg.

I would like to condense the listing to something like

  • Birthday ( 3 pictures )
  • Picknic ( 2 pictures )
  • Afternoon ( 1 picture )

How should I go about detecting the groups?

SU3
  • 5,064
  • 3
  • 35
  • 66
Ambirex
  • 811
  • 4
  • 9

3 Answers3

6

Here's one way you can solve this, which is more efficient than a brute force method.

  • load all the names into an associative array with key equal to the name and value equal to the name but with digits stripped (preg_replace('/\d//g', $key)).

You will have something like $arr1 = [Birthday001 => Birthday, Birthday002 => Birthday ...]

  • now make another associative array with keys that are values from the first array and value which is a count. Increment the count when you've already seen the key.
  • in the end you will end up with a 2nd array that contains the names and counts, just like you wanted. Something like $arr2 = [Birthday => 2, ...]
Artem Russakovskii
  • 21,516
  • 18
  • 92
  • 115
  • 1
    This would work if you assume that all semantical tokens are equal once the digits are stripped. This wouldn't address items like "My Birthday001.jpg" and "MyBirthday002.jpg", but a good starting point though. – Kitson Jul 26 '09 at 17:28
  • I absolutely agree. However, the question was not posed that way and whoever edited it to include My Birthday and group it with Birthday001, Birthday002 has changed the question considerably. The OP may actually want to group that into 2 different groups. – Artem Russakovskii Jul 26 '09 at 17:47
  • Yes, this is pretty much exactly what I am looking for. My main concern was matching the prefix string. This is a great starting point. Thank you. – Ambirex Jul 26 '09 at 18:24
  • I rolled back that edit adding the "My Birthday" entry--that was out of line. – Alan Moore Jul 27 '09 at 04:14
  • To deal with things like "My Birthday", you could try using the `levenshtein` function to calculate distance between tokens, and automatically group tokens with a distance less than a pre-set threshold. – Tobias Cohen Jul 27 '09 at 04:18
  • I had started going down that path, but the question became how do I automagically determine that threshold. – Ambirex Jul 30 '09 at 05:12
2

Simply build a histogram whose keys are modified by a regex:

<?php

# input
$filenames = array("Birthday001.jpg", "Birthday002.jpg", "Birthday003.jpg", "Picknic1.jpg", "Picknic2.jpg", "Afternoon.jpg");

# create histogram
$histogram = array();
foreach ($filenames as $filename) {
    $name = preg_replace('/\d+\.[^.]*$/', '', $filename);
    if (isset($histogram[$name])) {
        $histogram[$name]++;
    } else {
        $histogram[$name] = 1;
    }
}

# output
foreach ($histogram as $name => $count) {
    if ($count == 1) {
        echo "$name ($count picture)\n";
    } else {
        echo "$name ($count pictures)\n";
    }
}

?>
vog
  • 23,517
  • 11
  • 59
  • 75
0

Generate an array of words like "my" (developing this array will be very important, "my" is the only one in your example given) and strip these out of all the file names. Strip out all numbers and punctuation, also extensions should be long gone at this point. Once this is done, put all of the unique results into an array. You can then use this as a fairly reliable source of keywords to search for any stragglers that the other processing didn't catch.

Alex S
  • 25,241
  • 18
  • 52
  • 63
  • 1
    Note: this answer is based on a revised version of the question which has since been rolled back. That version included a file named "My Birthday.jpg" which was supposed to be grouped with the other "Birthday" files. – Alan Moore Jul 27 '09 at 04:23