1

I have a function to read a text file and cross match with a directory search to maths the descriptions (text file) with a directory index for files. I used the leveltensin function to give some fuzzy logic so names don't need to be 100% identical but i'm running into a snag, as I have it set up now i'm hetting a memory wall because when i uncomment the lines below it searches the entire txt file and compares every single ling against the directory file name. With over 700 files each being checked 700 times i quickly run out of memory. i need some way to jump out of the while (!feof($file_handle) ) when it finds a match then find some way to set the starting point for the next pass to the line position we stopped it so it isint looping 0-700 every single time

function GenerateList($titleB, $descB, $thumbB, $dirB, $patternB){
$outputB = "<CATEGORY name=\"$titleB\" desc=\"$descB\" thumb=\"$thumbB\">";
$open_error = 0;

if (is_dir($dirB)){
$myDirectory = opendir($dirB);
// get each entry
while($entryName = readdir($myDirectory)) {
    $dirArray[] = $entryName;
}

// close directory
closedir($myDirectory);

//  count elements in array
$indexCount = count($dirArray);

// sort em
sort($dirArray);
// loop through the array of files and print them all
if (!($text = file_get_contents("Scripts/descriptions.txt"))){$open_error = 1;}
$results = array();
for($index=0; $index < $indexCount; $index++) {
    $ext = explode(".", $dirArray[$index]);
    $parsed_title = preg_replace ($patternB, "", $ext[0]);
    if ((substr("$dirArray[$index]", 0, 1) != ".")&&($ext[1] == "flv")){ // don't list hidden files

//if ($open_error == 0){
//  $file_handle = fopen("Scripts/descriptions.txt", "rb");

//while (!feof($file_handle) ) {
//$line_of_text = fgets($file_handle);
//$parts = explode('|', $line_of_text);
/*
echo "<PRE>";
echo strtolower($parts[0]);
echo "</br>";
echo strtolower($parsed_title);
echo "</br>";
echo "</PRE>";
*/
//if ((wordMatch(strtolower($parts[0]), strtolower($parsed_title), 2)) > 0){
        $outputB .= "<ITEM>";
        $outputB .= "<file_path>/Sources/Power Rangers/$dirB".$dirArray[$index]."</file_path>";
        $outputB .= "<file_width>500</file_width>";
        $outputB .= "<file_height>375</file_height>";
        $outputB .= "<file_title>".$parsed_title."</file_title>";
//      $outputB .= "<file_desc>".$parts[1]."</file_desc>";
        $outputB .= "<file_desc>test</file_desc>";
//      $outputB .= "<file_image>".$match_result[2]."</file_image>";
        $outputB .= "<file_image>$thumbB</file_image>";
//      $outputB .= "<featured_image>".$match_result[3]."</featured_image>";
        $outputB .= "<featured_image>$thumbB</featured_image>";
//      $outputB .= "<featured_or_not>".$parts[4]."</featured_or_not>";
        $outputB .= "<featured_or_not>true</featured_or_not>";
        $outputB .= "</ITEM>";
//};//if ((wordMatch($parts[0], strtolower($word), 2) > 0)
//};//while
//fclose($file_handle);

//};//if ($open_error == 0)
    };//if ((substr("$dirArray[$index]", 0, 1) != ".")&&($ext[1] == "flv"))
};//for($index=0; $index < $indexCount; $index++) 
};//if (file_exists($dirB))
$outputB .= "</CATEGORY>";
return $outputB;
};//function

    function wordMatch($words, $input, $sensitivity){ 
        $shortest = -1; 
        foreach ($words as $word) { 
            $lev = levenshtein($input, $word); 
            if ($lev == 0) { 
                $closest = $word; 
                $shortest = 0; 
                break; 
            } //if
            if ($lev <= $shortest || $shortest < 0) { 
                $closest  = $word; 
                $shortest = $lev; 
            } //if
        } //foreach
        if($shortest <= $sensitivity){ 
            return $closest; 
        } else { 
            return 0; 
        } //if/else
    } // function, http://php.net/manual/en/function.levenshtein.php
NekoLLX
  • 219
  • 6
  • 19
  • How do you define "80%"? A regex either matches or it doesn't. – ghoti Jul 29 '12 at 03:01
  • thats the tricky part if $parsed is say "Peace Love and Woe" and matches is "Peace Love & Woe" or "Peace Love andWoe" or "Peace Love and Woe.avi" it should all be valid – NekoLLX Jul 29 '12 at 03:16
  • So ... your "80%" rule isn't so much a defined rule as it is a thing you would like help defining? Have you considered [fuzy logic](http://en.wikipedia.org/wiki/Fuzzy_logic)? You won't be able to implement it in a regular expression, but it might get you closer to your goal. ALSO, including some sample data (along with your idea of how much a match it represents) would make it easier to write something that actually matches your requirements. – ghoti Jul 29 '12 at 03:35
  • samples are easy enough the parsed Data is the left menu here http://maskedriders.info/Sources/Power%20Rangers/ while the file is here http://maskedriders.info/Sources/Power%20Rangers/Scripts/descriptions.txt the idea is I want to match in the txt with the parsed title so i can load into my xml data in the other fields after the | on the txt – NekoLLX Jul 29 '12 at 03:48
  • as for Fuzzy logic after explaining it i was thinking "ifi just strip out ehite spaces, (), and convert any & to and and cast it all to lower case" it should fill the 80% rule the problem is i would have to do the same to the text file but only before | – NekoLLX Jul 29 '12 at 03:56
  • Sounds like you have a tentative solution. Why don't you try it out, and if there are still challenges, see what the intelligentsia of StackOverflow can offer? – ghoti Jul 30 '12 at 03:34

1 Answers1

1

Instead of a regex, you can compute the edit distance between the two items. Your 80% heuristic would then be equivalent to saying that (length-edit_distance)/length >= .8 where length is the length of the string you're trying to match.

So if the string was 20 chars long and the edit distance from your target was 2, you would calculate that (20-2) / 20 == .9 In other words, that item was a 90% match to your target. That's higher than .8, so you accept it as a match.

Note that 'edit distance' is also known as Levenshtein distance, so you just do something like:

$len = (float) strlen($target);  // Avoids integer division.
$match = ($len-levenshtein($input, $target))/$len;

if ($match >= 0.8) {
  // The $input matches our $target
}
Qsario
  • 1,016
  • 1
  • 9
  • 18
  • Nice idea. If you can include (or point to) some example code that calculates edit distance between strings in PHP, that would definitely get you StackOverflow points. :) – ghoti Jul 29 '12 at 13:51
  • There is also the issue of searching a txt file for a link matchingthe huristics, i'd really like to avoid loading the extire TXT into a variable or array only to discard most of it since i have to loop through it about 700 times to populate the xml file for each entry – NekoLLX Jul 29 '12 at 23:43
  • I think edit distance is a PHP function already? http://php.net/manual/en/function.levenshtein.php As for the rest, don't all your links start with http:// or https://? You could just pull out everything that looks like a link, then do your fuzzy matching stuff. – Qsario Jul 30 '12 at 17:33
  • ok i think I'm getting somewhere but ive run into a snag, check the modified OP to see the problem – NekoLLX Jul 31 '12 at 17:55
  • What's the snag? The question looks the same to me? – Qsario Jul 31 '12 at 17:58
  • wow your fast i was still updating the OP when you were reading, check now – NekoLLX Jul 31 '12 at 18:03
  • This would have been better as a new question, given that this one is already "answered" so people may not look at it much. Anyhow, the problem is with how you're calling this. It looks like you're doing GenerateList 700 times. You need to invert things so that you go through each file *once* and keep attempting to match an individual line against whatever until you either find a match or find that nothing will match it. Then it will be simple to skip to the next line whenever you have a match. – Qsario Jul 31 '12 at 18:36
  • the problem is that WordMatch requires 2 paramaneters the first is the txt file the second is the directory so i need access to both to do a match check – NekoLLX Jul 31 '12 at 18:45
  • The `WordMatch` code I see has only a list of words, a target word, and a sensitivity? Where does it take a directory? – Qsario Jul 31 '12 at 18:59
  • the first parameter is fed from the txt file (though i should probably change it from acceptiong a araay to a single string input), the second is the match target thats $parsed_title currently which is the current file indexed from the directory, the 3rd is sensativity. So wordMatch is comparting a txt file against a arrayposition (dirctory) and seeing if they match with a sensativity of 2 – NekoLLX Jul 31 '12 at 19:05
  • If you're going for a bunch of 1-to-1 matches, slurp that whole text file into an array (kind of like you do with the directories), then remove the matched items from both arrays after you have generated your output. ( http://stackoverflow.com/questions/369602/how-to-delete-an-element-from-an-array-in-php ) – Qsario Jul 31 '12 at 21:16
  • gonna create a new discussion to go on from here – NekoLLX Jul 31 '12 at 22:23