How to clear Dataframe memory in pandas?

Question

I am converting fixedwidth file to delimiter file ('|' delimiter) using pandas read_fwf method. My input file ("infile.txt") is around 16GB and 9.9 Million records, while creating a dataframe it is occupying almost 3times of memory(around 48GB) before it creates outputfile. Can someone help me in impoving below logic please and through somelight where this extra memory is from (I know 'seq_id, fname and loaddatime will occupy some space it should in couple of GBs only).

Note: I am processing multiple files(similar size files) in loop one after the other. so i have to clear the memory before next file takes over.

'''infile.txt'''

1234567890AAAAAAAAAA
1234567890BBBBBBBBBB
1234567890CCCCCCCCCC

'''test_layout.csv'''

FIELD_NAME,START_POS,END_POS
FIELD1,0,10
FIELD2,10,20

'''test.py'''

import datetime
import pandas as pd
import csv
from collections import OrderedDict
import gc
seq_id = 1
fname= 'infile.txt'
loadDatetime = '04/10/2018'
in_layout = open("test_layout.csv","rt")
reader = csv.DictReader(in_layout)
boundries, col_names = [[],[]]
for row in reader:
    boundries.append(tuple([int(str(row['START_POS']).strip()) , int(str(row['END_POS']).strip())]))
    col_names.append(str(row['FIELD_NAME']).strip())
dataf = pd.read_fwf(fname, quoting=3, colspecs = boundries, dtype = object, names = col_names)
len_df = len(dataf)
'''Used pair of key, value tuples and OrderedDict to preserve the order of the columns'''
mod_dataf = pd.DataFrame(OrderedDict((('seq_id',[seq_id]*len_df),('fname',[fname]*len_df))), dtype=object)
ldt_ser = pd.Series([loadDatetime]*len_df,name='loadDatetime', dtype=object)
dataf = pd.concat([mod_dataf, dataf],axis=1)
alldfs = [mod_dataf]
del alldfs
gc.collect()
mod_dataf = pd.DataFrame()
dataf = pd.concat([dataf,ldt_ser],axis=1)
dataf.to_csv("outfile.txt", sep='|', quoting=3, escapechar='\\' , index=False, header=False,encoding='utf-8')
''' Release Memory used by DataFrames '''
alldfs = [dataf]
del ldt_ser
del alldfs
gc.collect()
dataf = pd.DataFrame()

I used garbage collector , del dataframe and initialised to clear memory used but still total memory is not released from dataframe. Inspired by https://stackoverflow.com/a/49144260/2799214

'''OUTPUT'''

1|infile.txt|1234567890|AAAAAAAAAA|04/10/2018
1|infile.txt|1234567890|BBBBBBBBBB|04/10/2018
1|infile.txt|1234567890|CCCCCCCCCC|04/10/2018

Gilles Criton · Accepted Answer · 2019-01-30T20:59:17.280

I had the same problem as you using https://stackoverflow.com/a/49144260/2799214 I found a solution using gc.collect() by splitting my code in different methods within a class. For example:

Class A:
    def __init__(self):
        # your code

    def first_part_of_my_code(self):
        # your code
        # I want to clear my dataframe
        del my_dataframe
        gc.collect()
        my_dataframe = pd.DataFrame() # not sure whether this line really helps
        return my_new_light_dataframe

    def second_part_of_my_code(self):
        # my code
        # same principle

So When the program call the methods, The garbage collector clear the memory once the program leaves the method.

How to clear Dataframe memory in pandas?

1 Answers1