1

I'm using python 2.7 with pandas 0.20.2 to read multiple CSV files to a DataFrame (one at a time), create lists from selected Series within the DF, and then dump that information into a MySQL database.

The curious behavior is that the first loop works fine (tried various files, and the first always works well), but the subsequent write to MySQL hangs with significant memory leakage until there is a segmentation fault (core dumped) error.

I'll try to give a minimal example here:

import pandas as pd
import MySQLdb

file_list = ['file1.csv','file2.csv']

db = MySQLdb.connect(...)

with db as c:
    c.execute("USE my_db")

    for f in file_list:
        df = pd.read_csv(f, parse_dates = ['date_time'])

        sql_dict = {
                    'date_time': df['date_time'].dt.strftime('%Y-%m-%d %H:%M:%S').tolist(),
                    'num':df['num'].tolist()
                    }

        columns = sql_dict.keys()
        values = sql_dict.values()

        join_string = "(" + ",".join("%s" for c in columns) + ")"
        query = "REPLACE INTO std_ints ({}) VALUES ".format(", ".join(columns)) + ",".join(join_string for _ in values[0])
        flat_vals = [ i for s in zip(*values) for i in s ]

        c.execute(query, flat_vals)
        db.commit()

db.close()

The first loop executes flawlessly, and then runs through the second loop until the c.execute(query, flat_vals) line where it hangs and never completes.

I've tried a million various iterations of .commit(), closing or not closing db, all to no avail.

I can't tell if this is a pandas error or a MySQL error, but the dataframes both generate quickly and take up the same exact space in memory, so I'm thinking it's a MySQL issue.

I also read this thread regarding multithreading and included a new connection for each loop in the file_list, also to no avail.

Any help would be appreciated.

elPastor
  • 8,435
  • 11
  • 53
  • 81
  • How do you install the MySQL driver? Also, check that the client library version match the server version. Sometimes `pip uninstall` and `pip install` again help, since it rebuild the binary and link to the MySQL client library correctly. – tungd Sep 24 '17 at 16:16
  • @tungd - Thanks for the suggestion, I'll give it a shot if nothing else works. Given the fact that I can insert data into the database using various data sets, I'm inclined to believe that is not the reason... but I could easily be mistaken. – elPastor Sep 24 '17 at 16:18

1 Answers1

0

Despite my line about "this thread regarding multithreading", I think that might have held the answer.

New, working code:

import pandas as pd
import MySQLdb

file_list = ['file1.csv','file2.csv']

for f in file_list:
    db = MySQLdb.connect(...)

    with db as c:
        c.execute("USE my_db")

        df = pd.read_csv(f, parse_dates = ['date_time'])

        sql_dict = {
                    'date_time': df['date_time'].dt.strftime('%Y-%m-%d %H:%M:%S').tolist(),
                    'num':df['num'].tolist()
                    }

        columns = sql_dict.keys()
        values = sql_dict.values()

        join_string = "(" + ",".join("%s" for c in columns) + ")"
        query = "REPLACE INTO std_ints ({}) VALUES ".format(", ".join(columns)) + ",".join(join_string for _ in values[0])
        flat_vals = [ i for s in zip(*values) for i in s ]

        c.execute(query, flat_vals)
        db.commit()

Note the lack of db.close() in the updated version. Frankly, I'm not sure what was the key, but either that and/or the reordering of the MySQL connection call did the trick.

elPastor
  • 8,435
  • 11
  • 53
  • 81