17

I have written some code in PHP that returns the html content from .edu domains. A brief introduction is given here: Errors regarding Web Crawler in PHP

The crawler works fine when the number of links to crawl are small (something around 40 URLS) but I am getting "MySQL server has gone away" error after this number.

I am storing html content as longtext in MySQL tables and I am not getting why the error arrives after at least 40-50 insertions.

Any help in this regard is highly appreciated.

Please note that I have already altered the wait_timeout and max_allowed_packet to accomodate my queries and the php code and now I don't know what to do. Please help me in this regard.

Community
  • 1
  • 1
Rafay
  • 6,108
  • 11
  • 51
  • 71

5 Answers5

16

You might be inclined to handle this problem by "pinging" the mysql server before a query. This is a bad idea. For more on why, check this SO post: Should I ping mysql server before each query?

The best way to handle the issue is by wrapping queries inside try/catch blocks and catching any database exceptions so that you can handle them appropriately. This is especially important in long running and/or daemon type scripts. So, here's a very basic example using a "connection manager" to control access to DB connections:

class DbPool {

    private $connections = array();

    function addConnection($id, $dsn) {
        $this->connections[$id] = array(
            'dsn' => $dsn,
            'conn' => null
        );
    }

    function getConnection($id) {
        if (!isset($this->connections[$id])) {
            throw new Exception('Invalid DB connection requested');
        } elseif (isset($this->connections[$id]['conn'])) {
            return $this->connections[$id]['conn'];
        } else {
            try {
                // for mysql you need to supply user/pass as well
                $conn = new PDO($dsn);

                // Tell PDO to throw an exception on error
                // (like "MySQL server has gone away")
                $conn->setAttribute(
                    PDO::ATTR_ERRMODE,
                    PDO::ERRMODE_EXCEPTION
                );
                $this->connections[$id]['conn'] = $conn;

                return $conn;
            } catch (PDOException $e) {
                return false;
            }
        }
    }

    function close($id) {
        if (!isset($this->connections[$id])) {
            throw new Exception('Invalid DB connection requested');
        }
        $this->connections[$id]['conn'] = null;
    }


}


class Crawler {

    private $dbPool;

    function __construct(DbPool $dbPool) {
        $this->dbPool = $dbPool;
    }

    function crawl() {
        // craw and store data in $crawledData variable
        $this->save($crawledData);
    }

    function saveData($crawledData) {
        if (!$conn = $this->dbPool->getConnection('write_conn') {
            // doh! couldn't retrieve DB connection ... handle it
        } else {
            try {
                // perform query on the $conn database connection
            } catch (Exception $e) {
                $msg = $e->getMessage();
                if (strstr($msg, 'MySQL server has gone away') {
                    $this->dbPool->close('write_conn');
                    $this->saveData($val);
                } else {
                    // some other error occurred
                }
            }
        }
    }
}
Community
  • 1
  • 1
  • 2
    No, this is an Exception class you would specify yourself and throw from inside the `saveData()` function. I've updated the `saveData` function and added a custom DbException class in my answer to reflect this ... –  Dec 31 '11 at 18:58
4

I have another answer that deals with what I think is a similar problem, and it would require a similar answer. Basically, you can use the mysql_ping() function to test the connection before your insert. Before MySQL 5.0.14, mysql_ping() would automatically reconnect the server, but now you have to build your own reconnect logic. Something similar to this should work for you:

function check_dbconn($connection) {
    if (!mysql_ping($connection)) {
        mysql_close($connection);
        $connection = mysql_connect('server', 'username', 'password');
        mysql_select_db('db',$connection);
    } 
    return $connection;
}

foreach($array as $value) {
    $dbconn = check_dbconn($dbconn);
    $sql="insert into collected values('".$value."')";
    $res=mysql_query($sql, $dbconn);
    //then some extra code.
}
Community
  • 1
  • 1
bhamby
  • 15,112
  • 1
  • 45
  • 66
  • 1
    Pinging is not a good strategy in this case ... for more on why, check this SO post: [Should I ping mysql server before each query?](http://stackoverflow.com/questions/3103969/should-i-ping-mysql-server-before-each-query) –  Dec 31 '11 at 18:15
1

I was facing "Mysql server has gone away" error while using Mysql connector 5.X, replacing dll to the last version solved the problem.

lmcanavals
  • 2,339
  • 1
  • 24
  • 35
Francois
  • 27
  • 1
0

Are you opening a single DB connection and reusing it? Is it possible that its a simple timeout? You might be better served by opening a new DB connection for each of your read/write operations (IE contact .edu, get text, open DB, write text, close db, repeat).

Also how are you using the handle? Is it possible that it has hit an error and has 'gone away' for that reason?

ethrbunny
  • 10,379
  • 9
  • 69
  • 131
  • Should I open a new connection for each query and then close it after executing that query?? And repeat the procedure for all the queries?? – Rafay Dec 31 '11 at 18:16
  • 4
    For the record, opening a new connection for each query is horribly inefficient ... –  Dec 31 '11 at 18:25
  • It's worth adding that if a thread is killed on the db (with `KILL [thread id]`) then you'll get the "server has gone away" error too. – James C Dec 31 '11 at 19:19
  • @rdlowrey - when you're talking about a slow process like reading web sites the inefficiency of (re)loading a db connection is minimal by comparison. Im not implying that he get a new connection for every query - but perhaps for each thread. (Assuming 1 thread = 1 site and 1 file written into the DB) – ethrbunny Jan 01 '12 at 00:22
  • I think it would be sufficient for this answer to say, "For whatever reason, MySQL ran out of connections." I upvoted you anyway :-) – PJ Brunet Jun 02 '15 at 02:01
0

Well This is what I am doing now based on rdlowrey's suggestion and I guess this is also right.

public function url_db_html($sourceLink = NULL, $source) {
    $source = mysql_real_escape_string($source);

    $query = "INSERT INTO html (id, sourceLink, sourceCode)
            VALUES (NULL,('$sourceLink') , ('$source'))";

    try {
        if(mysql_query($query, $this->connection)==FALSE) {
            $msg = mysql_errno($this->connection) . ": " . mysql_error($this->connection);
            throw new DbException($msg);
        }           
    } catch (DbException $e) {
        echo "<br><br>Catched!!!<br><br>";
        if(strstr($e->getMessage(), 'MySQL server has gone away')) {
            $this->connection = mysql_connect("localhost", "root", "");
            mysql_select_db("crawler1", $this->connection);
        }
    }
}

So once the query has failed to execute, the script will skip it but will make sure the connection is re-established.

However, my web crawler is crashing when files such as .jpg, .bmp, .pdf, etc are encountered. Is there a way to skip those urls containing these extensions. I am using preg_match and has given pdf and doc to match. Yet I want the function to skip all links containing extensions such as mp3, pdf, etc. Is this possible??

Rafay
  • 6,108
  • 11
  • 51
  • 71
  • If your db connection is closing it would be from 1 of 2 reasons: 1) your code is closing it. 2) Your system has some major problem. I've never seen this reconnect strategy used as I've never seen a situation where it should be required. Instead of reconnecting in your catch block, try logging the exception details and debug the problem from there. – Rob Apodaca Jan 01 '12 at 17:50