0

Note:

The snippets of PHP only help illustrate the schema for this question, the question is primarily targeted at MySQL.

The problem:

I am having an issue speeding up the insertion of data into a database. The amount of data ranges from a few thousand rows to several million. This data needs to be inserted as quickly as possible.

This is the background:

I have created a small library which loads a file containing details of a file system. This file is read line by line as a CSV file, then the columns are processed and inserted into a generated table.

Creation of the table:

This is a creation using Doctrine2, however it should be fairly self explanatory.

$schema = new Schema();
$table = $schema->createTable($tableName);
$table->addOption('engine', 'MyISAM');

$table->addColumn('id',       'integer', array('unsigned' => true, 'autoincrement' => true));
$table->addColumn('path',     'string',  array('length' => 255));
$table->addColumn('name',     'string',  array('length' => 255));
$table->addColumn('pathname', 'string',  array('length' => 255));
$table->addColumn('atime',    'integer');
$table->addColumn('mtime',    'integer');
$table->addColumn('is_dir',   'boolean');
$table->addColumn('length',   'integer');

$table->setPrimaryKey(array('id'));

$queries = $schema->toSql($this->conn->getDatabasePlatform());

foreach($queries as $query) {
    $this->conn->executeQuery($query);
}

First attempt:

Prior to my optimization during the insert process, the table creation also included these indexes. These are the indexes I need to add.

$table->addUniqueIndex(array('pathname'), 'IDX_PATHNAME');

$table->addIndex(array('path'),   'IDX_PATH');
$table->addIndex(array('name'),   'IDX_NAME');
$table->addIndex(array('atime'),  'IDX_ACCESSED_TIME');
$table->addIndex(array('mtime'),  'IDX_MODIFIED_TIME');
$table->addIndex(array('is_dir'), 'IDX_IS_DIR');
$table->addIndex(array('length'), 'IDX_LENGTH');

$queries = $schema->toSql($this->conn->getDatabasePlatform());

foreach($queries as $query) {
    $this->conn->executeQuery($query);
}

$stmt = $this->conn->prepare(sprintf('CREATE FULLTEXT INDEX IDX_PATHNAME_FULLTEXT ON %s (pathname)', $this->table));
$stmt->execute();

When I inserted anything past 200,000 rows, normally Mysql would slow to a crawl and PHP might even run out of memory (not sure why).

Indexes after insert:

I read here: Is it better to create an index before filling a table with data, or after the data is in place? That creating the indexes after the inserts made for a faster insert.

Luckily for me, once the data is inserted once, it never needs to be modified again (aside from dropping the table when no longer needed), so adding the indexes post insert suits me very well.

$diff = new TableDiff($this->table);
$indexes = array(
    new Index('IDX_PATHNAME',      array('pathname'), true),
    new Index('IDX_PATH',          array('path')),
    new Index('IDX_NAME',          array('name')),
    new Index('IDX_ACCESSED_TIME', array('atime')),
    new Index('IDX_MODIFIED_TIME', array('mtime')),
    new Index('IDX_IS_DIR',        array('is_dir')),
    new Index('IDX_LENGTH',        array('length')),
);

$diff->addedIndexes = $indexes;

$this->schemaManager->alterTable($diff);

$stmt = $this->conn->prepare(sprintf('CREATE FULLTEXT INDEX IDX_PATHNAME_FULLTEXT ON %s (pathname)', $this->table));
$stmt->execute();

Glimmer of hope:

I tested this and the insert process was blindingly fast, even with 2+ million rows, and PHP memory didn't even exceed 2.5%, with Mysql only using around 8% (of 4GB RAM Xubuntu 64bit).

The brick wall:

Once I added the script to update the table with the indexes after insert my efforts were defeated. Although it persisted until completion (as in it didn't crash or freeze which is a plus), it still took roughly the same time as when the indexes were added to begin with.

I am now looking for ways to optimize the schema or insert order in the hope of indexing the table more quickly.

Community
  • 1
  • 1
Flosculus
  • 6,880
  • 3
  • 18
  • 42

1 Answers1

0

Just use: LOAD DATA INFILE

Read more here: http://dev.mysql.com/doc/refman/5.1/en/load-data.html

This is the best method to load CSV.

BaBL86
  • 2,602
  • 1
  • 14
  • 13
  • The file requires processing prior to insert. Also the insert itself is no longer the problem, but the adding of the indexes. – Flosculus Jun 16 '14 at 11:38