1

I have a simple Python script that is adding data from a column in one .tsv file to rows in another .tsv file. The files looks like this;

  • data_table.tsv: 49KB, 179 rows
  • ref_table.tsv: 2.4GB, 1643334 rows

In order to correctly match entries from ref_table to the corresponding entry in data_table, I need to match on several keys. To do this, I have a simple script that looks like this;

#!/usr/bin/env python3
"""
Add the ref_table CScore value to matching rows in the data_table
Usage:
merge.py ref_table.tsv data_table.tsv
"""
import sys
import csv
args = sys.argv[1:]
ref_table = args[0]
data_table = args[1]
output_file = "output.tsv"

# make an index based on the ref table
index = {}
with open(ref_table) as fin:
    reader = csv.DictReader(fin, delimiter = '\t')
    for row in reader:
        key = (
            row['Hugo_Symbol'],
            row['Entrez_Gene_Id'],
            row['Chromosome'],
            row['Start_Position'],
            row['End_Position'],
            row['Tumor_Sample_Barcode'],
            row['Matched_Norm_Sample_Barcode']
        )
        index[key] = row['CScore']

print("index size: {} entries, {} MB".format( len(index), sys.getsizeof(index) / (1024 * 1024) ))
# index size: 1643184 entries, 80.00009155273438 MB

# load the rows of the data table
with open(data_table) as fin, open(output_file, "w") as fout:
    reader = csv.DictReader(fin, delimiter = '\t')
    fieldnames = [ f for f in reader.fieldnames]
    fieldnames.append('CScore')
    writer = csv.DictWriter(fout, fieldnames = fieldnames, delimiter = '\t')
    writer.writeheader()
    for row in reader:
        key = (
            row['Hugo_Symbol'],
            row['Entrez_Gene_Id'],
            row['Chromosome'],
            row['Start_Position'],
            row['End_Position'],
            row['Tumor_Sample_Barcode'],
            row['Matched_Norm_Sample_Barcode']
        )
        # add the matching value from the ref_table
        row['CScore'] = index[key]
        writer.writerow(row)

As you can see here, all I am doing is making an index dict with keys based on the columns for matching, and a single value based on the CScore for that row. Then, I just iterate over the lines in the data_table, and look up the CScore value from index based on the same keys. While running, I am printing out the size in MB of the object index, since that is the main data-carrying object in the script. I get a reported size like this;

index size: 1643184 entries, 80.00009155273438 MB

This seems pretty reasonable. Since everything else in this script is only iterating over rows (lines) in files, there shouldn't be any other major memory usage, right?

Well, when I profile the peak RSS memory usage by this script, I am reporting values of >890MB of memory used by the script. This value seems accurate, since the script is also breaking on systems where the allocated memory is specified to be lower than this.

Where is this excessive memory usage coming from? I have tried sprinkling in liberal usage of del and gc.collect() in various locations inside the script, and it has not had any effect except to greatly slow down the script.

Note that I am running this under Python 3.6, but have seen similar results with Python 3.7 too.

user5359531
  • 3,217
  • 6
  • 30
  • 55
  • 2
    `sys.getsizeof(X)` only gives you the memory taken by *`X` itself*, not any other objects that are contained by, or otherwise referenced by, that object. – jasonharper Apr 19 '21 at 16:49
  • so you are suggesting that the contents of `index` dict here is actually larger than the reported size of `index` itself? – user5359531 Apr 19 '21 at 16:51
  • Just adding to the top comment above. You may try a recursive recipe from https://stackoverflow.com/questions/449560/how-do-i-determine-the-size-of-an-object-in-python – Vytas Apr 19 '21 at 16:55
  • so, I tried changing the script's `key` value to be a str concantenation of the fields (`key = '.'.join([ row['Hugo_Symbol'], ... ])`; this did not change the size reported by `sys.getsizeof`. However, it actually did greatly reduce total memory usage down to ~250MB. So maybe using the tuple of values as a key was inflating the memory. – user5359531 Apr 19 '21 at 16:59
  • If modifying your key makes such a significant difference in memory usage, perhaps you should look into making a hash of those fields for a key. Just be wary of possible duplicate hashes. – RufusVS Apr 19 '21 at 17:10
  • isnt "making a hash of the fields for a key" exactly what a Python dict is doing already? – user5359531 Apr 19 '21 at 17:57
  • @user5359531 You'd think, but the OP said changing from tuple to string changed the size significantly, which you wouldn't expect if the strings and tuples were all being hashed. – RufusVS Apr 19 '21 at 18:18

1 Answers1

0

As per the comments, I tried changing this line

key = (
            row['Hugo_Symbol'],
            row['Entrez_Gene_Id'],
            row['Chromosome'],
            row['Start_Position'],
            row['End_Position'],
            row['Tumor_Sample_Barcode'],
            row['Matched_Norm_Sample_Barcode']
        )

to this

key = '.'.join([row['Hugo_Symbol'],
        row['Entrez_Gene_Id'],
        row['Chromosome'],
        row['Start_Position'],
        row['End_Position'],
        row['Tumor_Sample_Barcode'],
        row['Matched_Norm_Sample_Barcode']])

so that I am using a string value for the dict key, instead of a tuple of strings, and this dropped the total memory usage down from 890MB to ~250MB. Still much more than I would like but a significant improvement.

user5359531
  • 3,217
  • 6
  • 30
  • 55