3

I have a CSV input file with aprox. 4 million records. The insert is running since +2hours and still has not finished. The Database is still empty.

Any suggestions on how to to actually insert the values (using insert into) and faster, like breaking the insert in chunks?

I'm pretty new to python.

  • csv file example
43293,cancelled,1,0.0,
1049007,cancelled,1,0.0,
438255,live,1,0.0,classA
1007255,xpto,1,0.0,
  • python script
def csv_to_DB(xing_csv_input, db_opts):
    print("Inserting csv file {} to database {}".format(xing_csv_input, db_opts['host']))
    conn = pymysql.connect(**db_opts)
    cur = conn.cursor()
    try:
        with open(xing_csv_input, newline='') as csvfile:
            csv_data = csv.reader(csvfile, delimiter=',', quotechar='"')
            for row in csv_data:
                insert_str = "INSERT INTO table_x (ID, desc, desc_version, val, class) VALUES (%s, %s, %s, %s, %s)"
                cur.execute(insert_str, row)
        conn.commit()
    finally:
        conn.close()

UPDATE: Thanks for all the inputs. As suggested, I tried a counter to insert in batches of 100 and a smaller csv data set (1000 lines). The problem now is only 100 records are inserted, although the counter passes 10 x 100 several times.

code change:

def csv_to_DB(xing_csv_input, db_opts):
   print("Inserting csv file {} to database {}".format(xing_csv_input, db_opts['host']))
   conn = pymysql.connect(**db_opts)
   cur = conn.cursor()
   count = 0
   try:
       with open(xing_csv_input, newline='') as csvfile:
           csv_data = csv.reader(csvfile, delimiter=',', quotechar='"')
           for row in csv_data:
               count += 1
               print(count)
               insert_str = "INSERT INTO table_x (ID, desc, desc_version, val, class) VALUES (%s, %s, %s, %s, %s)"

               if count >= 100:
                  cur.execute(insert_str, row)
                  print("count100")
                  conn.commit()
                  count = 0

               if not row:
                  cur.execute(insert_str, row)
                  conn.commit()
   finally:
       conn.close()
Underoos
  • 4,708
  • 8
  • 42
  • 85
JonyD
  • 1,237
  • 3
  • 21
  • 34
  • 2
    The database is empty because it has not reached the `commit()` yet. You could move that into the loop but that will only slow things down further. Perhaps you could introduce a counter and commit every 1000 records? – Daniel Roseman Jun 17 '19 at 10:58
  • 1
    Run the code on a small subset of the data to test if it works as intended. – Sam Jun 17 '19 at 11:00
  • Is the database in another location over a slow network connection? – cardamom Jun 17 '19 at 11:33
  • 1
    use this method is the fastest method to get csv file into mysql : `LOAD DATA INFILE 'file.csv' INTO TABLE MyTable`. my 90 million rows takes about 25 minutes. – MEdwin Jun 17 '19 at 11:51
  • @JonyD At your finally: should you have a final conn.commit() to get the last block of data recorded? Before you conn.close(). – Wilson Hauck Jun 17 '19 at 14:20
  • @Wilson Hauck I think I don't need it there. `finally` will always be executed. If by some error there is nothing to commit, `commit` in `finally` would make no sense. Not sure tough. I'm new in this – JonyD Jun 17 '19 at 14:27
  • @JonyD You will figure it out. Good Luck – Wilson Hauck Jun 17 '19 at 14:29

2 Answers2

4

There are many ways to optimise this insert. Here are some ideas:

  1. You have a for loop over the entire dataset. You can do a commit() every 100 or so
  2. You can insert many rows into one insert
  3. you can combine the two and make a multi-row insert every 100 rows on your CSV
  4. If python is not a requirement for you can do it directly using MySQL as it's explained here. (If you must do it using python, you can still prepare that statement in python and avoid looping through the file manually).

Examples:

for number 2 in the list, the code will have the following structure:

def csv_to_DB(xing_csv_input, db_opts):
    print("Inserting csv file {} to database {}".format(xing_csv_input, db_opts['host']))
    conn = pymysql.connect(**db_opts)
    cur = conn.cursor()
    try:
        with open(xing_csv_input, newline='') as csvfile:
            csv_data = csv.reader(csvfile, delimiter=',', quotechar='"')
            to_insert = []
            insert_str = "INSERT INTO table_x (ID, desc, desc_version, val, class) VALUES "
            template = '(%s, %s, %s, %s, %s)'
            count = 0
            for row in csv_data:
                count += 1
                to_insert.append(tuple(row))
                if count % 100 == 0:
                    query = insert_str + '\n'.join([template % r for r in to_insert])
                    cur.execute(query)
                    to_insert = []
                    conn.commit()
            query = insert_str + '\n'.join(template % to_insert)
            cur.execute(query)
            conn.commit()
    finally:
        conn.close()
Mateo Torres
  • 1,545
  • 1
  • 13
  • 22
  • on point 2, I tried but it does not seem possible using it inside the the with `open(xing_csv_input, newline='') as csvfile:` and the `row` in the `for` ( not enough arguments for format string). Does not seem to be able to fill all the values from several rows. Also here says "using parameters does not work well in this case, and it is usually a good idea to execute each statement on its own" - https://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlcursor-execute.html – JonyD Jun 17 '19 at 12:56
  • For that particular scenario you should look into building one string with many rows [here's a good read on efficient string concatenation](https://waymoot.org/home/python_string/). The idea would be to iterate to (say) 100 rows while building one insert SQL statement, and then run that insert followed by a `commit()` call. Does it make sense? – Mateo Torres Jun 17 '19 at 13:10
  • Makes total sense and it's my goal to make it efficient. Problem is the use of `cur.execute(cmd, args)` where the args are row of a csv reader – JonyD Jun 17 '19 at 13:14
  • 1
    I edited my answer to provide an example of the structure that solution number 2 would have. Again, depending on the number of rows in your file, there might be a better solution. – Mateo Torres Jun 17 '19 at 13:26
  • the last colum in the CSV file sometimes has values and most of the time has `''` (empty string, as displayed in the csv example). This causes an error in line `query = insert_str + '\n'.join(template % row)` : _"TypeError: not enough arguments for format string"_. I guess because the last `%s` is `''` (or has event no assigned value). Would `"... {}...".format( varX)` be a better solution? – JonyD Jun 17 '19 at 16:59
  • plus, I think there is a typo in your code. I think it should be `query = insert_str.join(template % (row)).join('\n')` instead of `query = insert_str + '\n'.join(template % row)` – JonyD Jun 17 '19 at 17:43
  • and there should be a `if count != 0 and ` in the if to skip the first record (when count is 0) – JonyD Jun 17 '19 at 17:51
  • There were a couple of mistakes in my solution. The `join` should be applied to the list instead. I don't think there should be any skip. If the statement is `insert_str.join(template % row).join('\n')` I believe the code will simply fail. Have a look at python's [join](https://docs.python.org/3.7/library/stdtypes.html#str.join) method – Mateo Torres Jun 17 '19 at 23:09
  • I'm still having problems with the string formatting. I created another [question](https://stackoverflow.com/questions/56645469/python-string-formatting-with-percentage-typeerror-not-enough-arguments-for-f) since it's out of scope from this one. I'm accepting your answer. Thanks for the help – JonyD Jun 18 '19 at 09:07
  • I edited the solution once more, the expression should have iterated over the `to_insert` list. Hope it works – Mateo Torres Jun 18 '19 at 11:06
0

Here. Try this snippet and let me know if it worked using executemany().

with open(xing_csv_input, newline='') as csvfile:
    csv_data = tuple(csv.reader(csvfile, delimiter=',', quotechar='"'))
    csv_data = (row for row in csv_data)
    query = "INSERT INTO table_x (ID, desc, desc_version, val, class) VALUES (%s, %s, %s, %s, %s)"
    try:
        cur.executemany(query, csv_data)
        conn.commit()
    except:
        conn.rollback()
Underoos
  • 4,708
  • 8
  • 42
  • 85