0

The following program has been running for about ~22 hours on two files (txt, ~10MB ea.). Each file has about ~100K rows. Can someone give me an indication of how inefficient my code is and perhaps a faster method. The input dict are ordered and preserving order is necessary:

import collections

def uniq(input):
  output = []
  for x in input:
    if x not in output:
      output.append(x)
  return output

Su = {}
with open ('Sucrose_rivacombined.txt') as f:
    for line in f:
        (key, val) = line.split('\t')
        Su[(key)] = val
    Su_OD = collections.OrderedDict(Su)

Su_keys = Su_OD.keys()
Et = {}

with open ('Ethanol_rivacombined.txt') as g:
    for line in g:
        (key, val) = line.split('\t')
        Et[(key)] = val
    Et_OD = collections.OrderedDict(Et)

Et_keys = Et_OD.keys()

merged_keys = Su_keys + Et_keys
merged_keys =  uniq(merged_keys)

d3=collections.OrderedDict()

output_doc = open("compare.txt","w+")

for chr_local in merged_keys:
    line_output = chr_local
    if (Et.has_key(chr_local)):
        line_output = line_output + "\t" + Et[chr_local]
    else:
        line_output = line_output + "\t" + "ND"
    if (Su.has_key(chr_local)):
        line_output = line_output + "\t" + Su[chr_local]
    else:
        line_output = line_output + "\t" + "ND"

    output_doc.write(line_output + "\n")

The input files are as follows: not every key is present in both files

Su:
chr1:3266359    80.64516129
chr1:3409983    100
chr1:3837894    75.70093458
chr1:3967565    100
chr1:3977957    100


Et:
chr1:3266359    95
chr1:3456683    78
chr1:3837894    54.93395855
chr1:3967565    100
chr1:3976722    23

I would like the output to look as follows:

chr1:3266359    80.645    95
chr1:3456683    ND        78
jobrant
  • 25
  • 1
  • 4
  • 1
    Why not profile it on smaller inputs and see for yourself where the time is being spent? – NPE May 02 '12 at 16:04
  • I am not sure how to do that. I ran it earlier on half sized files and it took only about 3 hours. The CPU usage is any 25% and RAM is only about 1.6GB with ~6GB to spare so its not like its stressing resources. I just wonder if I have mis-coded something causing it to continue reading through files unnecessarily. – jobrant May 02 '12 at 16:12
  • Have you verified this does what you want? Because `Su` is a normal dict, you've already lost the filesystem order when you convert it to `Su_OD`. You probably want to create the ordered dict up front? – Francis Avila May 02 '12 at 16:12
  • 1
    Is there some missing indentation below the 'for' portion to ensure the if-elses go through the loops? Rather than repeatedly concatenating strings, you might consider just directly writing instead of constructing line_output each time. Oh, and you might also consider the CSV module for your TSV output. – hexparrot May 02 '12 at 16:13
  • @jobrant: I mean using an automated tool that would tell you where the time is being spent. See http://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script (for best results, you might need to refactor your code a bit.) – NPE May 02 '12 at 16:13
  • @hexparrot. sorry about that, I have edited doc to note. Thanks – jobrant May 02 '12 at 16:22
  • @jobrant able to give us 2 or 3 lines from each of the files so we can best recommend improvements? – hexparrot May 02 '12 at 16:31
  • @hexparrot, I have edited doc to show input format – jobrant May 02 '12 at 16:43
  • I should also add: the program is done now, however the format is wrong. when ND is true it makes format correct, but when two vlaues are present, second values is on new line? When I ran on test data, the input was split on () and produced correct format, this time it is split on ('\t') and format is wrong? – jobrant May 02 '12 at 16:45

3 Answers3

3

Replace uniq with this as the inputs are hashable:

def uniq(input):
  output = []
  s = set()
  for x in input:
    if x not in s:
      output.append(x)
      s.add(x)
  return output

This will reduce a nearly O(n^2) process to nearly O(n).

Dan D.
  • 73,243
  • 15
  • 104
  • 123
1

You don't need your unique function.

pseudo code like:

  1. read file 2 as OrderedDict
  2. process file 1 writing out it's item (already ordered correctly)
  3. pop, with defalut from file 2 for last part of the output line
  4. after file one is consumed process the Ordered dict from file 2

Also, love list comprehensions...you can read the file with:

OrderedDict(line.strip().split('\t') for line in open('Ethanol_rivacombined.txt'))

Only one ordered dict and 'Sucrose_rivacombined.txt' never even makes it into memory. should be super fast

EDIT complete code (not sure about your output line format)

from collections import OrderedDict

Et_OD = OrderedDict(line.strip().split('\t') for line in open('Ethanol_rivacombined.txt'))

with open("compare.txt","w+") as output_doc:
    for line in open('Sucrose_rivacombined.txt'):
        key,val = line.strip().split('\t')
        line_out = '\t'.join((key,val,Et_OD.pop(key,'ND')))
        output_doc.write(line_out+'\n')

    for key,val in Et_OD.items():
        line_out = '\t'.join((key,'ND',val))
        output_doc.write(line_out+'\n')
Phil Cooper
  • 5,747
  • 1
  • 25
  • 41
  • Thank you. This is nice and compact and works extremely fast with data sets of over 500k rows of values. – jobrant May 02 '12 at 18:46
0

your output is a list, but your input is dictionaries: their keys are guaranteed unique, but your not in output is going to need to compare against every element of the list, which is combinatorial. (You're doing n^2 comparisons because of that not check.)

You could probably completely replace uniq with:

Su_OD.update(Et_OD)

This works for me:

from collections import OrderedDict

one = OrderedDict()
two = OrderedDict()

one['hello'] = 'one'
one['world'] = 'one'

two['world'] = 'two'
two['cats'] = 'two'

one.update(two)

for k, v in one.iteritems():
    print k, v

output:

    hello one
    world two
    cats two
quodlibetor
  • 8,185
  • 4
  • 35
  • 48
  • Using `Su_OD.update(Et_OD)` would not work as there may be the same key in both of `Su_OD` and `Et_OD` and that expression would overwrite the key from `Su_OD` with the value from `Et_OD` thus losing a data value the OP wants to keep. – Dan D. May 02 '12 at 19:29
  • Oh I misunderstood, I thought that OP wanted just one of the data values (actually I thought that they were the same in both cases). yeah this is wrong. – quodlibetor May 02 '12 at 19:54