I'm using python 2.7 with pandas 0.20.2 to read multiple CSV files to a DataFrame (one at a time), create lists from selected Series within the DF, and then dump that information into a MySQL database.
The curious behavior is that the first loop works fine (tried various files, and the first always works well), but the subsequent write to MySQL hangs with significant memory leakage until there is a segmentation fault (core dumped)
error.
I'll try to give a minimal example here:
import pandas as pd
import MySQLdb
file_list = ['file1.csv','file2.csv']
db = MySQLdb.connect(...)
with db as c:
c.execute("USE my_db")
for f in file_list:
df = pd.read_csv(f, parse_dates = ['date_time'])
sql_dict = {
'date_time': df['date_time'].dt.strftime('%Y-%m-%d %H:%M:%S').tolist(),
'num':df['num'].tolist()
}
columns = sql_dict.keys()
values = sql_dict.values()
join_string = "(" + ",".join("%s" for c in columns) + ")"
query = "REPLACE INTO std_ints ({}) VALUES ".format(", ".join(columns)) + ",".join(join_string for _ in values[0])
flat_vals = [ i for s in zip(*values) for i in s ]
c.execute(query, flat_vals)
db.commit()
db.close()
The first loop executes flawlessly, and then runs through the second loop until the c.execute(query, flat_vals)
line where it hangs and never completes.
I've tried a million various iterations of .commit()
, closing or not closing db
, all to no avail.
I can't tell if this is a pandas error or a MySQL error, but the dataframes both generate quickly and take up the same exact space in memory, so I'm thinking it's a MySQL issue.
I also read this thread regarding multithreading and included a new connection for each loop in the file_list
, also to no avail.
Any help would be appreciated.