I have (what I would consider) a massive set of plain text files, around 400GB, that are being imported into a MySQL database (InnoDB engine). The .txt files range from 2GB to 26GB in size, and each file represents a table in the database. I was given a Python script which parses the .txt files and builds SQL statements. I have a machine specifically dedicated to this task with the following specs:
- OS - Windows 10
- 32GB RAM
- 4TB hard drive
- i7 3.40 GHz processor
I want to optimize this import to be as quick and dirty as possible. I've changed the following config settings in the MySQL my.ini file based on stack O questions, the MySQL docs, and other sources:
max_allowed_packet=1073741824;
autocommit=0;
net_buffer_length=0;
foreign_key_check=0;
unique_checks=0;
innodb_buffer_pool_size=8G; (this made a big difference in speed when I increased from the default of 128M)
Are there other settings in the config file that I missed (maybe around logging or caching) that would direct MySQL to use a significant portion of the machine's resources? Could there be another bottleneck I'm missing?
(Side note: not sure if this is related - when I start the import, the mysqld
process spins up to use about 13-15% of the system's memory, but then never seems to purge it when I stop the Python script from continuing the import. I'm wondering if this is a result of messing with the logging and flush settings. Thanks in advance for any help.)
(EDIT)
Here is the relevant part of the Python script that populates the tables. It appears the script is connecting, committing and closing the connection for every 50,000 records. Could I remove the conn.commit()
at the end of the function and let MySQL handle the committing? The comments below the while (true)
are from the authors of the script, and I've adjusted that number so that it won't exceed max_allowed_packet size.
conn = self.connect()
while (True):
#By default, we concatenate 200 inserts into a single INSERT statement.
#a large batch size per insert improves performance, until you start hitting max_packet_size issues.
#If you increase MySQL server's max_packet_size, you may get increased performance by increasing maxNum
records = self.parser.nextRecords(maxNum=50000)
if (not records):
break
escapedRecords = self._escapeRecords(records) #This will sanitize the records
stringList = ["(%s)" % (", ".join(aRecord)) for aRecord in escapedRecords]
cur = conn.cursor()
colVals = unicode(", ".join(stringList), 'utf-8')
exStr = exStrTemplate % (commandString, ignoreString, tableName, colNamesStr, colVals)
#unquote NULLs
exStr = exStr.replace("'NULL'", "NULL")
exStr = exStr.replace("'null'", "NULL")
try:
cur.execute(exStr)
except MySQLdb.Warning, e:
LOGGER.warning(str(e))
except MySQLdb.IntegrityError, e:
#This is likely a primary key constraint violation; should only be hit if skipKeyViolators is False
LOGGER.error("Error %d: %s", e.args[0], e.args[1])
self.lastRecordIngested = self.parser.latestRecordNum
recCheck = self._checkProgress()
if recCheck:
LOGGER.info("...at record %i...", recCheck)
conn.commit()
conn.close()