-9

Why is my code slow, if I remove all stuff to get those meta tags the code becomes super fast, I don't know OOP though.

It was pretty good after I added it the functionality to get meta tags from a URL. I don't know if SQL is making it slow, or the get_meta_tags function.

<?php
$servername = "localhost";
$username = "phpmyadmin";
$password = "supersu";
$starttime = time();
function cape($url,$depth=5) {
    if($depth>0) {
        $html = file_get_contents($url);
        $pattern = '~<a.*?href="(.*?)".*?>~'; 
        preg_match_all($pattern, $html, $matches); 
        foreach($matches[1] as $newurl) { 
            // do stuff 
            $regex = "((https?|ftp)\:\/\/)?"; 
            // SCHEME 
            $regex .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)?@)?"; 
            // User and Pass 
            $regex .= "([a-z0-9-.]*)\.([a-z]{2,3})"; 
            // Host or IP 
            $regex .= "(\:[0-9]{2,5})?"; 
            // Port 
            $regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; 
            // Path 
            $regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; 
            // GET Query 
            $regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; 
            // Anchor 
            // `i` flag for case-insensitive
            if(preg_match("/^$regex$/i", $newurl))  {
                if(substr($newurl, -1) !== "/") { 
                    $newurl = $newurl . "/"; 
                } 
                try { 
                    $tags = get_meta_tags($newurl); 
                    $desc = file_get_contents($newurl); 
                    $password = "supersu"; 
                    $username = "phpmyadmin"; 
                    $conn = new PDO("mysql:host=localhost;dbname=supersu", $username, $password); 
                    $conn->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION); 
                    if(isset($tags['description'])) { 
                        $conn->exec( 'INSERT INTO snapd (link,Description) SELECT * FROM (SELECT "'.$newurl.'","'.$tags['description'].'") AS tmp WHERE NOT EXISTS ( SELECT link FROM snapd WHERE link = "'.$newurl.'" ) LIMIT 1' );
                        echo $newurl; 
                    } else { 
                        $conn->exec( 'INSERT INTO snapd (link,Description) SELECT * FROM (SELECT "'.$newurl.'","No Info") AS tmp WHERE NOT EXISTS ( SELECT link FROM snapd WHERE link = "'.$newurl.'" ) LIMIT 1' ); 
                        echo $newurl; 
                    } 
                } catch(PDOException $e) { 
                    echo "Connection failed: " . $e->getMessage(); 
                } 
                cape($newurl,$depth-1); 
            } 
        } 
    } 
} 
cape("https://techdeploy.xyz"); 
?>
The General
  • 1,239
  • 2
  • 17
  • 28
genx
  • 13
  • 3
  • 2
    _“I don't know if SQL is making it slow, or the get_meta_tags function.”_ - probably rather the latter, because HTTP requests take up (comparably) huge amounts of time. – CBroe Oct 28 '20 at 09:26
  • 3
    First step in debugging is writing readable code, like adding new lines on each instructions. – Cid Oct 28 '20 at 09:28
  • If I'm reading this right, this code takes a URL, then crawls *all* URLs it finds in that page, recursively, to a depth of five pages? I have no idea about the structure of the first page, but that feels like it could be an enormous amount of requests. – iainn Oct 28 '20 at 09:46
  • @iainn Yes you got it correct, however this runs pretty fast without the use of get_meta_contents and then uploading that to the database, however with this function and uploading the stuff in database, this script becomes very slow. Somewhat 1 urls per 2 seconds before it was 15-20 urls per 2 seconds. – genx Oct 28 '20 at 10:07
  • This is pretty close to attempting to [parse HTML with a regex](https://stackoverflow.com/a/1732454/1145801) – The General Oct 28 '20 at 21:11

1 Answers1

1
  1. You're parsing HTML with a regex. You'd likely be better offering using DOM functions (there's almost certainly also a library for parsing the components from a URL that's better than a regex).
  2. Both get_meta_tags and file_get_contents are making HTTP requests to the same URL. Make the request once, cache the result, and then do your processing on that result.
  3. You're creating a new DB connection for every page you process, you'd probably be better off keeping one connection open for the duration of the function.
  4. You're directly interpolating strings into your DB query. This is incredibly insecure (particularly when the strings are from someone else's server). It would be more secure and faster to use prepared statements.
  5. I have no idea what the state of php's async IO support is (I suspect not good), but if there was a way to fire off some of the requests concurrently and then process the results that would be significantly faster (though it would use more memory).
The General
  • 1,239
  • 2
  • 17
  • 28