0

So, I am currently dealing with a large amount of data(62 files, ~150mb each, with ~3.5 million rows each) and I need to upload these flat files to MySQL database, so I used pandas library to read through the .csv files and make one dataframe containing all the data.

About the data:

The data is split into multiple files as each file contains info about a particular location(coordinates) on different days. For e.g: one file contains data for about 3.5 million coordinates(these are fixed and common among all the files) on a particular day and I have 62 such files meaning data on 62 different dates.
I want to see all this data in one table, i.e, I want to add the extra columns(info of different dates corresponding to a list of locations) into the table and the columns that are common(locations/coordinates) should appear only once in this table. An illustration of how I want the data in the table is present here

Till now my code looks like this:

#relevant imports:
import os
import MySQLdb as sql
import numpy as np
from functools import reduce
from sqlalchemy import create_engine
import glob
import pandas as pd

#Accessing database:
engine = create_engine("mysql://root:PASSWORD@localhost/DBNAME")
con = engine.connect()

#Directory change:

path="C:\\Users\\DELL\\Desktop\\DataAnalysis-Internship 18'\\extracted data\\"
os.chdir(path)

#listing file names:
files=glob.glob("IN.201*.csv")
num_files=len(files)

#DATAFRAMES:
#initialiasing first dataframe:
df=pd.read_csv(path+files[0])
df.rename(columns={"avg":"avg"+str(0)}, inplace=True)

for file in files[1:]:
  i=1
  df_next=pd.read_csv(path+file)
  df_next.rename(columns={"avg":"avg"+str(i)}, inplace=True)
  df=df.merge(df_next, on=['lon','lat','country'])
  i=i+1

df.to_sql(con=con, name='final_table', if_exists='replace', chunksize=10000)

con.close() 

When I run this code though, my PC starts executing the command and slowly the performance goes down and eventually the PC starts hanging. Probably too much memory is being used up maybe because I am making a single dataframe out of many and then transferring it to the database(I am not sure).

What should I do now to upload all these files to my table in the way I want? Is there a more optimized way? I thought of merging and making tables from 5 files at a time and then using the attribute "if_exists='append'" in 'to_sql' command to join these multiple dataframes into one table instead of making one huge dataframe and then transferring it, but I am not sure about the 'append' attribute as to whether it will join the tables the way I want it to.

Hrithik Diwakar
  • 119
  • 2
  • 11

1 Answers1

1

My guess is that the problem arises with df = df.merge occuring in the loop

Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying. Source: Why does concatenation of DataFrames get exponentially slower?

Try adding all the dataframes to a list df_list and then concatenating them once at the end:

import pandas as pd

df_list = []
df_list.append(pd.DataFrame(dict(a=[1, 2, 3], b=[4, 5, 6])))
df_list.append(pd.DataFrame(dict(c=[1, 2, 3], d=[4, 5, 6])))

pd.concat(df, axis=1)
Out[]:
   a  b  c  d
0  1  4  1  4
1  2  5  2  5
2  3  6  3  6
Dillon
  • 997
  • 4
  • 13
  • Won't storing the dataframes in a list instead of reading the frame once and merging it straight away using the for loop, be heavy on the memory? – Hrithik Diwakar Jun 13 '18 at 12:57
  • @hrithik-diwakar I'm not sure, try it – Dillon Jun 13 '18 at 13:05
  • @hrithik-diwakar `pd.concat(*df, ignore index=true)` i.e. `*df` – Dillon Jun 13 '18 at 13:13
  • when i use concat, it produces extra columns, whereas i want the columns that are common among the two tables to appear only once and just add the extra columns just like what merge does. – Hrithik Diwakar Jun 14 '18 at 05:23
  • @HrithikDiwakar I've updated the answer to include a more thorough solution. Note the used of `axis=1` to account for column joining – Dillon Jun 14 '18 at 06:47
  • Thanks for the info, it helped, but storing dataframes in one single dataframe is still a problem. I checked the memory usage of one of my dataframes and it was ~400 mb, now for 60 such frames it would be ~400x60 mB on my memory, which is around 24 gB and I have 8 gigs of RAM on my desk. At the moment I am making a dataframe of 2 files at a time and sending it to SQL and planning to merge/join all the 31 tables using MySQL. Hopefully it will work, but it is awfully slow. I am still up for any advices on how to do this quicker. – Hrithik Diwakar Jun 14 '18 at 11:15