-1

i have following code, but. it's too Slow

<?php
 class Ngram {

 const SAMPLE_DIRECTORY = "samples/";
 const GENERATED_DIRECTORY = "languages/";
 const SOURCE_EXTENSION = ".txt";
 const GENERATED_EXTENSION = ".lng";
 const N_GRAM_MIN_LENGTH = "1";
 const N_GRAM_MAX_LENGTH = "6";

public function __construct() {
    mb_internal_encoding( 'UTF-8' );
    $this->generateNGram();
}

private function getFilePath() {
    $files = array();
    $excludes = array('.', '..');
    $path = rtrim(self::SAMPLE_DIRECTORY, DIRECTORY_SEPARATOR . '/');
    $files = scandir($path);
    $files = array_diff($files, $excludes);
    foreach ($files as $file) {

        if (is_dir($path . DIRECTORY_SEPARATOR . $file))
            fetchdir($path . DIRECTORY_SEPARATOR . $file, $callback);
        else if (!preg_match('/^.*\\' . self::SOURCE_EXTENSION . '$/', $file))
            continue;
        else
            $filesPath[] = $path . DIRECTORY_SEPARATOR . $file; 
    }
    unset($file);
    return $filesPath;
}
protected function removeUniCharCategories($string){
    //Replace punctuation(' " # % & ! . : , ? ¿) become space " "
    //Example : 'You&me', become 'You Me'.
    $string = preg_replace( "/\p{Po}/u", " ", $string );
    //--------------------------------------------------
    $string = preg_replace( "/[^\p{Ll}|\p{Lm}|\p{Lo}|\p{Lt}|\p{Lu}|\p{Zs}]/u", "", $string );
    $string = trim($string);
    $string = mb_strtolower($string,'UTF-8');
    return $string;
}
private function generateNGram() {
    $files = $this->getFilePath();
    foreach($files as $file) {
        $file_content = file_get_contents($file, FILE_TEXT);
        $file_content = $this->removeUniCharCategories($file_content);
        $words = explode(" ", $file_content);
        $tokens = array();
        foreach ($words as $word) {
            $word = "_" . $word . "_";
            $length = mb_strlen($word, 'UTF-8');
            for ($i = self::N_GRAM_MIN_LENGTH, $min =  min(self::N_GRAM_MAX_LENGTH, $length); $i <= $min; $i++) {
                for ($j = 0, $li = $length - $i; $j <= $li; $j++) {
                    $token = mb_substr($word, $j, $i, 'UTF-8');
                    if (trim($token, "_")) {
                        $tokens[] = $token;
                    }   
                }
            }
        }
        unset($word);
        $tokens = array_count_values($tokens);
        arsort($tokens);
        $ngrams = array_slice(array_keys($tokens), 0);
        file_put_contents(self::GENERATED_DIRECTORY . str_replace(self::SOURCE_EXTENSION, self::GENERATED_EXTENSION, basename($file)), implode(PHP_EOL, $ngrams));
    }
    unset($file);
}
}
$ii = new Ngram();
?>

How to make it fast ? Thanks

hakre
  • 193,403
  • 52
  • 435
  • 836
Ahmad
  • 4,224
  • 8
  • 29
  • 40

2 Answers2

3

Quickly searching to 'how to profile php' google led to this stackoverflow question: Simplest way to profile a PHP script this provides a really brief answer to your question.

not to mention all but you may find useful information here: http://www.php.net/apd http://www.xdebug.org/docs/profiler

Community
  • 1
  • 1
-1

PHP's foreach{} are way slower (up to 16 times) than for{}. Try replacing thoses in your generateNGram() function.

Plus you could copypaste your code from generateNGram() function into your constructor. It will prevent an useless call to a function.

monsieur_h
  • 1,360
  • 1
  • 10
  • 20
  • "PHP's foreach{} are way slower (up to 16 times) than for{}" [citation needed] ;) "Plus you could copypaste your code from generateNGram() function into your constructor. It will prevent an useless call to a function." Negligible, but it's a very bad habit to have too much stuff in the constructor – KingCrunch Jul 01 '11 at 13:42
  • I agree about the constructor thing, but as far as I know, foreach aren't a good thing unless you want to fetch multidimensionnals arrays while keeping an ID :foreach($this as $id=>$that){} – monsieur_h Jul 01 '11 at 13:47
  • OK, because you didn't, I've searched myself and found something, that the very opposite of what you propagate: http://www.phpbench.com/ (need to scroll down a little bit). – KingCrunch Jul 01 '11 at 13:51
  • I've seached but did not found any exemple in english. The magazine that made the bench was in french. Anyway your site states that a foreach($i as $z) is faster than a foreach($i as $id=>$z) wich I don't deny. I said that for($i;$i<$max;$i++) is faster than any foreach(). As a consequence any foreach can be avoided as long as you don't need to keep your ID in a variable during the loop.[Check out here](http://nathanhoad.net/php-for-vs-foreach) Great site by the way. – monsieur_h Jul 01 '11 at 13:58
  • "I said that for($i;$i<$max;$i++) is faster than any foreach()." Have a closer look and you'll see, that "my site" also states, that this is wrong. I don't say you are wrong, but you don't give an evidence and if your source is only published in one local magazine I don't know, if it's really reliable. (Your "Check out here"-link doesn't fit into this discussion :? It's about something quite different.) – KingCrunch Jul 01 '11 at 14:01
  • I've found some links saying it ["my" way](http://blog.yoda-bzh.net/index.php?post/2010/02/04/PHP-bench%3A-count-for-vs-foreach) some links saying it "your" way. I begin to suspect a PHP version issue involved. The only way to know should be to test then. – monsieur_h Jul 01 '11 at 14:07
  • Thanks all of you, i will change all loop to for() {} – Ahmad Jul 01 '11 at 14:18