I have a simple Python script that is adding data from a column in one .tsv file to rows in another .tsv file. The files looks like this;
data_table.tsv
: 49KB, 179 rowsref_table.tsv
: 2.4GB, 1643334 rows
In order to correctly match entries from ref_table
to the corresponding entry in data_table
, I need to match on several keys. To do this, I have a simple script that looks like this;
#!/usr/bin/env python3
"""
Add the ref_table CScore value to matching rows in the data_table
Usage:
merge.py ref_table.tsv data_table.tsv
"""
import sys
import csv
args = sys.argv[1:]
ref_table = args[0]
data_table = args[1]
output_file = "output.tsv"
# make an index based on the ref table
index = {}
with open(ref_table) as fin:
reader = csv.DictReader(fin, delimiter = '\t')
for row in reader:
key = (
row['Hugo_Symbol'],
row['Entrez_Gene_Id'],
row['Chromosome'],
row['Start_Position'],
row['End_Position'],
row['Tumor_Sample_Barcode'],
row['Matched_Norm_Sample_Barcode']
)
index[key] = row['CScore']
print("index size: {} entries, {} MB".format( len(index), sys.getsizeof(index) / (1024 * 1024) ))
# index size: 1643184 entries, 80.00009155273438 MB
# load the rows of the data table
with open(data_table) as fin, open(output_file, "w") as fout:
reader = csv.DictReader(fin, delimiter = '\t')
fieldnames = [ f for f in reader.fieldnames]
fieldnames.append('CScore')
writer = csv.DictWriter(fout, fieldnames = fieldnames, delimiter = '\t')
writer.writeheader()
for row in reader:
key = (
row['Hugo_Symbol'],
row['Entrez_Gene_Id'],
row['Chromosome'],
row['Start_Position'],
row['End_Position'],
row['Tumor_Sample_Barcode'],
row['Matched_Norm_Sample_Barcode']
)
# add the matching value from the ref_table
row['CScore'] = index[key]
writer.writerow(row)
As you can see here, all I am doing is making an index
dict with keys based on the columns for matching, and a single value based on the CScore
for that row. Then, I just iterate over the lines in the data_table
, and look up the CScore
value from index
based on the same keys. While running, I am printing out the size in MB of the object index
, since that is the main data-carrying object in the script. I get a reported size like this;
index size: 1643184 entries, 80.00009155273438 MB
This seems pretty reasonable. Since everything else in this script is only iterating over rows (lines) in files, there shouldn't be any other major memory usage, right?
Well, when I profile the peak RSS memory usage by this script, I am reporting values of >890MB of memory used by the script. This value seems accurate, since the script is also breaking on systems where the allocated memory is specified to be lower than this.
Where is this excessive memory usage coming from? I have tried sprinkling in liberal usage of del
and gc.collect()
in various locations inside the script, and it has not had any effect except to greatly slow down the script.
Note that I am running this under Python 3.6, but have seen similar results with Python 3.7 too.