The easy way to solve a problem like this is by using a dict
, or a pair of them, or even a pair of collections.Counter
s:
positions, scores = Counter(), Counter()
for sequence, position, strand, score in reader:
positions[sequence] += position
scores[sequence] += scores
for sequence in positions:
writer.writerow(sequence, positions[sequence], scores[sequence])
The problem is that this may be too large to fit into memory.
I'd certainly try it and see first. A 1GB input file doesn't mean you need 1GB of memory. Remember, you're only keeping track of the sums for each distinct gene, not for each row in the input file. In your example, that means only two genes, out of eight values.
But if you can't fit it in memory, you need a database. Here, you really just need a simple key-value database that acts like a dict
, not something fancy. And Python has that built in, as dbm
.
For example (writing things overly verbose to make sure it's easy to understand):
import csv
import shelve
with dbm.open('p.db', 'w') as positions, dbm.open('s.db', 'w') as scores:
with open('input.tsv') as infile:
for row in csv.DictReader(infile, delimiter='\t'):
sequence = row['Sequence']
position = row['Position']
score = row['Score']
old_position = positions.get(sequence, '0')
positions[sequence] = str(int(old_position) + int(position))
old_score = scores.get(sequence, '0')
scores[sequence] = str(int(old_score) + int(score))
with open('output.tsv', 'w') as outfile:
writer = csv.writer(outfile, delimiter='\t')
for sequence in positions:
writer.writerow((sequence, positions[sequence], scores[sequence]))
If you need to do something more complicated, where a simple key-value database won't work, then you'll probably want to rewrite your logic in SQL and use the sqlite3
module to execute it for you.
And if your database is so huge that SQLite can't handle it, you can look at MySQL or another external database engine.