I am converting fixedwidth file to delimiter file ('|' delimiter) using pandas read_fwf method. My input file ("infile.txt") is around 16GB and 9.9 Million records, while creating a dataframe it is occupying almost 3times of memory(around 48GB) before it creates outputfile. Can someone help me in impoving below logic please and through somelight where this extra memory is from (I know 'seq_id, fname and loaddatime will occupy some space it should in couple of GBs only).
Note: I am processing multiple files(similar size files) in loop one after the other. so i have to clear the memory before next file takes over.
'''infile.txt'''
1234567890AAAAAAAAAA
1234567890BBBBBBBBBB
1234567890CCCCCCCCCC
'''test_layout.csv'''
FIELD_NAME,START_POS,END_POS
FIELD1,0,10
FIELD2,10,20
'''test.py'''
import datetime
import pandas as pd
import csv
from collections import OrderedDict
import gc
seq_id = 1
fname= 'infile.txt'
loadDatetime = '04/10/2018'
in_layout = open("test_layout.csv","rt")
reader = csv.DictReader(in_layout)
boundries, col_names = [[],[]]
for row in reader:
boundries.append(tuple([int(str(row['START_POS']).strip()) , int(str(row['END_POS']).strip())]))
col_names.append(str(row['FIELD_NAME']).strip())
dataf = pd.read_fwf(fname, quoting=3, colspecs = boundries, dtype = object, names = col_names)
len_df = len(dataf)
'''Used pair of key, value tuples and OrderedDict to preserve the order of the columns'''
mod_dataf = pd.DataFrame(OrderedDict((('seq_id',[seq_id]*len_df),('fname',[fname]*len_df))), dtype=object)
ldt_ser = pd.Series([loadDatetime]*len_df,name='loadDatetime', dtype=object)
dataf = pd.concat([mod_dataf, dataf],axis=1)
alldfs = [mod_dataf]
del alldfs
gc.collect()
mod_dataf = pd.DataFrame()
dataf = pd.concat([dataf,ldt_ser],axis=1)
dataf.to_csv("outfile.txt", sep='|', quoting=3, escapechar='\\' , index=False, header=False,encoding='utf-8')
''' Release Memory used by DataFrames '''
alldfs = [dataf]
del ldt_ser
del alldfs
gc.collect()
dataf = pd.DataFrame()
I used garbage collector , del dataframe and initialised to clear memory used but still total memory is not released from dataframe. Inspired by https://stackoverflow.com/a/49144260/2799214
'''OUTPUT'''
1|infile.txt|1234567890|AAAAAAAAAA|04/10/2018
1|infile.txt|1234567890|BBBBBBBBBB|04/10/2018
1|infile.txt|1234567890|CCCCCCCCCC|04/10/2018