I have an optimization problem. I need to join some csv (all have the same structure) in a unique csv.
The structure of my folders is following a temporal order: year/month/day/hour_file_identifier.csv. Thus 16y*365d*24h = 140160 files. This is the maximum number of files I can have as a result. Each folder indexed by day can have an arbitrary number of files (that I am generating with a python script too). As the number of files will exceed the maximum allowed for an ext4 filesystem every N iteration I need to join all the files in hour_indexed files, to not encounter this problem.
So, the input of what I call join_routine
is the folders indexed by time, as example:
2001/9/3/
2002/8/4/
Each of these folders can contain a varibale number of files as:
2001/9/3/
1-01.csv
1-02.csv
2-01.csv
2002/8/4/
1-01.csv
2-01.csv
3-01.csv
3-01.csv
The results of the join_routine
should be:
2001/9/3/
1-joined.csv
2-joined.csv
2002/8/4/
1-joined.csv
2-joined.csv
3-joined.csv
To do that I develop the following code:
def join_routine():
# print('JOIN ROUTINE')
directory_list = [x.replace('\\','/') for x in glob.glob('data/csv/plays/*/*/*/')]
for directory in directory_list:
for hour in range(0,13):
file_list = [x.replace('\\','/') for x in glob.glob(directory+ str(hour) +'-*.csv')]
if len(file_list) > 0:
df = read_csv_list(file_list)
df.to_csv(directory+str(hour)+'-joined.csv', index = False)
for file in [ x for x in file_list if x not in directory+str(hour)+'-joined.csv']:
os.remove(file)
def read_csv_list(file_list):
df_list = []
# with progressbar.ProgressBar(max_value=len(file_list)) as bar:
# i = 0
for file in file_list:
df = pd.read_csv(file)
df_list.append(df)
# i = i + 1
# bar.update(i)
return pd.concat(df_list, axis = 0, ignore_index = True)
The join_routine
function handles the join of each folder in a single process. I am wondering if there is a better and more important, a quicker way of doing that. The join_routine
is taking more than 10X of the file creation (that is done on a parallel process pool of 16 workers). I need to do the join_routine
21 times and is gonna take more than 1 week at this pace, which is not feasible.
Any idea?