2

I have (what I would consider) a massive set of plain text files, around 400GB, that are being imported into a MySQL database (InnoDB engine). The .txt files range from 2GB to 26GB in size, and each file represents a table in the database. I was given a Python script which parses the .txt files and builds SQL statements. I have a machine specifically dedicated to this task with the following specs:

  • OS - Windows 10
  • 32GB RAM
  • 4TB hard drive
  • i7 3.40 GHz processor

I want to optimize this import to be as quick and dirty as possible. I've changed the following config settings in the MySQL my.ini file based on stack O questions, the MySQL docs, and other sources:

max_allowed_packet=1073741824;

autocommit=0;

net_buffer_length=0;

foreign_key_check=0;

unique_checks=0;

innodb_buffer_pool_size=8G; (this made a big difference in speed when I increased from the default of 128M)

Are there other settings in the config file that I missed (maybe around logging or caching) that would direct MySQL to use a significant portion of the machine's resources? Could there be another bottleneck I'm missing?

(Side note: not sure if this is related - when I start the import, the mysqld process spins up to use about 13-15% of the system's memory, but then never seems to purge it when I stop the Python script from continuing the import. I'm wondering if this is a result of messing with the logging and flush settings. Thanks in advance for any help.)

(EDIT)

Here is the relevant part of the Python script that populates the tables. It appears the script is connecting, committing and closing the connection for every 50,000 records. Could I remove the conn.commit() at the end of the function and let MySQL handle the committing? The comments below the while (true) are from the authors of the script, and I've adjusted that number so that it won't exceed max_allowed_packet size.

    conn = self.connect()

    while (True):
        #By default, we concatenate 200 inserts into a single INSERT statement.
        #a large batch size per insert improves performance, until you start hitting max_packet_size issues.
        #If you increase MySQL server's max_packet_size, you may get increased performance by increasing maxNum
        records = self.parser.nextRecords(maxNum=50000)
        if (not records):
            break

        escapedRecords = self._escapeRecords(records) #This will sanitize the records
        stringList = ["(%s)" % (", ".join(aRecord)) for aRecord in escapedRecords]

        cur = conn.cursor()
        colVals = unicode(", ".join(stringList), 'utf-8')
        exStr = exStrTemplate % (commandString, ignoreString, tableName, colNamesStr, colVals)
        #unquote NULLs
        exStr = exStr.replace("'NULL'", "NULL")
        exStr = exStr.replace("'null'", "NULL")

        try:
            cur.execute(exStr)
        except MySQLdb.Warning, e:
            LOGGER.warning(str(e))
        except MySQLdb.IntegrityError, e:
        #This is likely a primary key constraint violation; should only be hit if skipKeyViolators is False
            LOGGER.error("Error %d: %s", e.args[0], e.args[1])
        self.lastRecordIngested = self.parser.latestRecordNum
        recCheck = self._checkProgress()
        if recCheck:
            LOGGER.info("...at record %i...", recCheck)
    conn.commit()
    conn.close()
Community
  • 1
  • 1
tom-g
  • 115
  • 2
  • 10
  • there are many setting for speedup, one is to start more processes, one for each table, second enable innodb_file_per_table in my.cnf , innodb_buffer_pool_size = 80% of the memory, table_cache . remove index from table and recreate it after insert – Bernd Buffen May 23 '16 at 21:08
  • Is the file a CSV file? Maybe you can use `LOAD DATA INFILE` instead of a script. – Barmar May 23 '16 at 21:23
  • I'd recommend looking into the python script also; I am not familiar with its libraries; but if it doesn't support (or isn't set up for) connection pooling, and the script opens and closes a connection for each record imported, more time can be taken just connecting/disconnecting than actual importing. – Uueerdo May 23 '16 at 21:42
  • Thanks for the responses. @BerndBuffen I enabled the innodb_file_per_table, set the buffer pool size to 20G (a little less than 80% of system memory). As far as spinning up multiple processes, do you mean innodb_buffer_pool_instances? I changed that as well from 8 to 16, and it's definitely had an impact on speed. The mysqld process only seems to max out at 15-20% of memory and 15% of CPU though. Anything I can do to increase those a bit more? Thanks for your help. – tom-g May 24 '16 at 16:03
  • Also @Uueerdo you may be right, I've included the relevant piece of the Python script. That could be contributing to the slowdown as well. Thanks. – tom-g May 24 '16 at 16:21
  • @tom-g - there are many things to speedup. the next is to recreate ALL table. all table are created before change to one_file_per_table are stored the data in ONE big file called ibdata and it will not be changed. so you must recreate a table that he use his own file. you can do it like this : CREATE newTable like yourTable; INSERT INTO newTable SELECT * from yourTable; -- verfify the data -- DROP TABLE yourTable; RENAME newTable to yourTable; - we can also chat or phone later if you want – Bernd Buffen May 24 '16 at 16:22
  • @tom-g I've only lightly dabbled in python (YEARS ago), but it looks like it is not doing the repeated connecting-disconnecting I suggested could be a problem. And, judging from the comments, was created by someone well enough versed with python and mysql to try to optimize the inserts. Short of rewriting it with prepared statements, which would likely require rewriting the parsing hidden in those function calls, I personally don't see anything that can be improved with the script. _(And in my experience multi-inserts like that usually have similar performance to prepared statements anyway.)_ – Uueerdo May 24 '16 at 16:28

0 Answers0