2

I'm wondering what the best way to parse long form data into wide for is in python. I've previously been doing this sort of task in R but it really is taking to long as my files can be upwards of 1 gb. Here is some dummy data:

Sequence Position Strand Score
Gene1    0        +      1
Gene1    1        +      0.25
Gene1    0        -      1
Gene1    1        -      0.5
Gene2    0        +      0
Gene2    1        +      0.1
Gene2    0        -      0
Gene2    1        -      0.5

But I'd like to have it in the wide form where I've summed the scores over the strands at each position. Here is output I hope for:

Sequence 0 1
Gene1    2 0.75
Gene2    0 0.6

Any help on how to attack such a problem conceptually would be really helpful.

UnivStudent
  • 402
  • 1
  • 3
  • 11
  • 2
    I'm sure python would be a good tool for the job but you have neglected to explain what you need to do. – Graeme Stuart Sep 23 '13 at 21:17
  • I suspect you're going to want to look at [pandas](http://pandas.pydata.org), but a concrete example would make it easier to help. (Whether it would be faster or not is an open question.) – DSM Sep 23 '13 at 21:17
  • 1
    are you talking about parsing fixed width csv ? "long form" and "wide form" are not well defined terms . are you looking to transpose a matrix ? – vish Sep 23 '13 at 21:20
  • Sorry All, I pressed submit too early and the question went up incomplete. I've posted an example now. – UnivStudent Sep 23 '13 at 21:28

2 Answers2

7

Both of these solutions seem like overkill when you can do it with pandas in a one-liner:

In [7]: df
Out[7]: 
  Sequence  Position Strand  Score
0    Gene1         0      +   1.00
1    Gene1         1      +   0.25
2    Gene1         0      -   1.00
3    Gene1         1      -   0.50
4    Gene2         0      +   0.00
5    Gene2         1      +   0.10
6    Gene2         0      -   0.00
7    Gene2         1      -   0.50

In [8]: df.groupby(['Sequence', 'Position']).Score.sum().unstack('Position')
Out[8]: 
Position  0     1
Sequence         
Gene1     2  0.75
Gene2     0  0.60

If you cannot load the file into memory then an out-of-core solution in the other answers will work too.

Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
0

The easy way to solve a problem like this is by using a dict, or a pair of them, or even a pair of collections.Counters:

positions, scores = Counter(), Counter()
for sequence, position, strand, score in reader:
    positions[sequence] += position
    scores[sequence] += scores

for sequence in positions:
    writer.writerow(sequence, positions[sequence], scores[sequence])

The problem is that this may be too large to fit into memory.

I'd certainly try it and see first. A 1GB input file doesn't mean you need 1GB of memory. Remember, you're only keeping track of the sums for each distinct gene, not for each row in the input file. In your example, that means only two genes, out of eight values.


But if you can't fit it in memory, you need a database. Here, you really just need a simple key-value database that acts like a dict, not something fancy. And Python has that built in, as dbm.

For example (writing things overly verbose to make sure it's easy to understand):

import csv
import shelve

with dbm.open('p.db', 'w') as positions, dbm.open('s.db', 'w') as scores:
    with open('input.tsv') as infile:
        for row in csv.DictReader(infile, delimiter='\t'):
            sequence = row['Sequence']
            position = row['Position']
            score = row['Score']
            old_position = positions.get(sequence, '0')
            positions[sequence] = str(int(old_position) + int(position))
            old_score = scores.get(sequence, '0')
            scores[sequence] = str(int(old_score) + int(score))
   with open('output.tsv', 'w') as outfile:
       writer = csv.writer(outfile, delimiter='\t')
       for sequence in positions:
           writer.writerow((sequence, positions[sequence], scores[sequence]))

If you need to do something more complicated, where a simple key-value database won't work, then you'll probably want to rewrite your logic in SQL and use the sqlite3 module to execute it for you.

And if your database is so huge that SQLite can't handle it, you can look at MySQL or another external database engine.

abarnert
  • 354,177
  • 51
  • 601
  • 671