-1

I've a list which has approximately 177071007 items. and i'm trying to perform the following operations a) get the first and last occurance of a unique item in the list. b) the number of occurances.

def parse_data(file, op_file_test):
    ins = csv.reader(open(file, 'rb'), delimiter = '\t')
    pc = list()
    rd = list()
    deltas = list()
    reoccurance = list()
    try:
        for row in ins:
            pc.append(int(row[0]))
            rd.append(int(row[1]))
    except:
        print row
        pass

    unique_pc = set(pc)
    unique_pc = list(unique_pc)
    print "closing file"

    #takes a long time from here!
    for a in range(0, len(unique_pc)):
        index_first_occurance = pc.index(unique_pc[a])
        index_last_occurance = len(pc) - 1 - pc[::-1].index(unique_pc[a])
        delta_rd = rd[index_last_occurance] - rd[index_first_occurance]
        deltas.append(int(delta_rd))
        reoccurance.append(pc.count(unique_pc[a]))
        print unique_pc[a] , delta_rd, reoccurance[a]

    print "printing to file"
    map_file =  open(op_file_test,'a')
    for a in range(0, len(unique_pc)):
        print >>map_file, "%d, %d, %d" % (unique_pc[a], deltas[a], reoccurance)
    map_file.close()

However the complexity is in the order of O(n). Would there be a possibility to make the for loop 'run fast', by that i mean, do you think yielding would make it fast? or is there any other way? unfortunately, i don't have numpy

pistal
  • 2,310
  • 13
  • 41
  • 65

4 Answers4

1

Try the following:

from collections import defaultdict

# Keep a dictionary of our rd and pc values, with the value as a list of the line numbers each occurs on
# e.g. {'10': [1, 45, 79]}
pc_elements = defaultdict(list)
rd_elements = defaultdict(list)

with open(file, 'rb') as f:
    line_number = 0
    csvin = csv.reader(f, delimiter='\t')
    for row in csvin:
        try:
            pc_elements[int(row[0])].append(line_number)
            rd_elements[int(row[1])].append(line_number)
            line_number += 1
        except ValueError:
            print("Not a number")
            print(row)
            line_number += 1
            continue

for pc, indexes in pc_elements.iteritems():
    print("pc  {0} appears {1} times. First on row {2}, last on row {3}".format(
        pc,
        len(indexes),
        indexes[0],
        indexes[-1]
    ))

This works by creating a dictionary, when reading the TSV with the pc value as the the key and a list of occurrences as the value. By the nature of a dict the key must be unique so we avoid the set and the list values are only being used to keep the rows that key occurs on.

Example:

pc_elements = {10: [4, 10, 18, 101], 8: [3, 12, 13]}

would output:

"pc 10 appears 4 times. First on row 4, last on row 101"
"pc 8 appears 3 times. First on row 3, last on row 13"
Ewan
  • 14,592
  • 6
  • 48
  • 62
  • how does continue differ from pass ? :) – pistal Feb 25 '14 at 11:18
  • @pistal - `continue` moves onto the next iteration of a loop (not running any code after it) while `pass` will run any code after it before the next iteration. A good answer with examples [here.](http://stackoverflow.com/a/9484008/1401034) – Ewan Feb 25 '14 at 11:21
0

Try replacing list by dicts, lookup in a dict is much faster than in a long list.

That could be something like this:

def parse_data(file, op_file_test):
  ins = csv.reader(open(file, 'rb'), delimiter = '\t')

  # Dict of pc -> [rd first occurence, rd last occurence, list of occurences]
  occurences = {} 

  for i in range(0, len(ins)):
    row = ins[i]
    try:
      pc = int(row[0])
      rd = int(row[1])
    except:
      print row
      continue

    if pc not in occurences:
      occurences[pc] = [rd, rd, i]
    else:
      occurences[pc][1] = rd
      occurences[pc].append(i)

  # (Remove the sorted is you don't need them sorted but need them faster)
  for value in sorted(occurences.keys()):
    print "value: %d, delta: %d, occurences: %s" % (
      value, occurences[value][1] - occurences[value][0],
      ", ".join(occurences[value][2:])
Nicolas Defranoux
  • 2,646
  • 1
  • 10
  • 13
0

As you scan items from your input file, put the items into a collections.defaultdict(list) where the key is the item and the value is a list of occurence indices. It will take linear time to read the file and build up this data structure and constant time to get the first and last occurrence index of an item, and constant time to get the number of occurrences of an item.

Here's how it might work

mydict = collections.defaultdict(list)
for item, index in itemfilereader: # O(n)
    mydict[item].append(index)

# first occurrence of item, O(1)
mydict[item][0]

# last occurrence of item, O(1)
mydict[item][-1]

# number of occurrences of item, O(1)
len(mydict[item])
IceArdor
  • 1,961
  • 19
  • 20
  • itemfilereader is your csvreader plus something that will give you the index of an item occurrence (line number). `enumerate` comes to mind. – IceArdor Feb 25 '14 at 11:13
  • Even with a O(1) data structure, reading in and allocating enough memory for 177 million items will be slow. Unless many of your items occur much more than once, you may want to skip any item you know you won't need while reading in the file. You can avoid reading in the entire file if you only need to know the first and last occurrence of an item. This is what the head and tail command line programs in Linux do. – IceArdor Feb 25 '14 at 11:23
0

Maybe it's worth chaning the data structure used. I'd use a dict that uses pc as key and the occurence as values.

lookup = dict{}
counter = 0
for line in ins:
    values = lookup.setdefault(int(line[0]),[])
    values.append(tuple(counter,int(line[1])))
    counter += 1

for key, val in lookup.iteritems():
    value_of_first_occurence = lookup[key][1][1]
    value_of_last_occurence = lookup[key][-1][1]
    first_occurence = lookup[key][1][0]
    last_occurence = lookup[key][-1][0]
    value = lookup[key][0]
Denis
  • 136
  • 6
  • Sorry, posted the wrong code. Now it's updated. I reads every line and stores the 'pc' as key in a dict, together with a list that contains the value as returned by the csv reader (the 'rd') plus a nested list of the occurences. – Denis Feb 25 '14 at 11:08
  • I apologize ones more. You're interested in the delta rd not in the delta position. I updated that. – Denis Feb 25 '14 at 11:17