0

A large csv I was given has a large table of flight data. A function I wrote to help parse it iterates over the column of Flight ID's, and then returns a dictionary containing the index and value of every unique Flight ID in order of first appearance.

Dictionary = { Index: FID, ... }

This comes as a quick adjustment to an older function that didn't require having to worry about FID repeats in the column (a few hundred thousand rows later...).

Right now, I have it iterating over and comparing each value in order. If a value is equal to the value after it, it skips it. If the next value is different, it stores the value in the dictionary. I changed it to now also check if that value has already occured before, and if so, to skip it.
Here's my code:

def DiscoverEarliestIndex(self, number):                                             
        finaldata = {}                                                        
        columnvalues = self.column(number)                                             
        columnenum = {}                                                         
        for a, b in enumerate(columnvalues):                                           
            columnenum[a] = b                                                   
        i = 0                                                                                                                    
        while i < (len(columnvalues) - 1):                                             
            next = columnenum[i+1]                                              
            if columnvalues[i] == next:                                                
                i += 1                                                          
            else:                                                               
                if next in finaldata.values():                                
                    i += 1                                                      
                    continue                                                    
                else:                                                           
                    finaldata[i+1]= next                                      
                    i += 1                                                      
        else:                                                                   
            return finaldata 

It's very inefficient, and slows down as the dictionary grows. The column has 5.2 million rows, so it's obviously not a good idea to handle this much with Python, but I'm stuck with it for now.

Is there a more efficient way to write this function?

Adam Barthelson
  • 1,088
  • 1
  • 11
  • 20
  • Aside: I'm not sure how well your nomenclature decisions -- both here and in a previous question -- will serve you professionally, although obviously your mileage may vary. It made me decide not to spend time pointing out certain problems, anyhow: life's too short. (Hint: what does `.values()` return? Why is membership testing with it a bad idea?) – DSM Mar 19 '13 at 21:07
  • 1
    Didn't realize my pastor was on Stackoverflow, I will keep that in mind next time. – Adam Barthelson Mar 19 '13 at 21:32

3 Answers3

1

You are essentially looking for a database. Databases are made exactly for such operations on large datasets. It will be much faster to parse the entire CSV at once using the CSV module and sending them in a database than storing them in a dict and running checks against the entire dict.

*large* python dictionary with persistence storage for quick look-ups

Community
  • 1
  • 1
nicbou
  • 1,047
  • 11
  • 16
1

To answer your question directly, you should be able to do this with dict comprehensions and the itertools module.

>>> import itertools as it
>>> data = {1: 'a', 2: 'a', 3: 'c', 4: 'c', 5:'d' }
>>> grouped_shit = {k: list(v) for (k,v) in it.groupby(data.iteritems(), lambda (_,v): v)}
>>> good_shit = {v[0][0]: k for (k, v) in grouped_shit.iteritems()}
>>> good_shit
{1: 'a', 3: 'c', 5: 'd'}

I think this can be tweeked a bit--I'm not super happy about going over the dict twice. But anyway, I think that the dict comprehensions are pretty efficient. Also, groupby assumes that your keys are in order---that is, it assumes that all the 'a's indices are grouped together, which is seems to be true in your case.

BenDundee
  • 4,389
  • 3
  • 28
  • 34
  • There's a small example of what the data looks like here in a previous question: [link](http://stackoverflow.com/questions/15148983/looking-for-a-more-efficient-way-to-reorganize-a-massive-csv-in-python) Note, all spaces have data in them in this case. – Adam Barthelson Mar 19 '13 at 21:40
1
if next in thegoodshit.values():   

is probably your problem what you are doing here is

  1. creating a list
  2. searching the list

maybe you can use a set to hold the values and search that - something like this:

    while i < (len(columnvalues) - 1):                                             
        next = columnenum[i+1]                                              
        if columnvalues[i] == next:                                                
            i += 1                                                          
        else:                                                               
            if next in searchable_data:                                
                i += 1                                                      
                continue                                                    
            else:                                                           
                finaldata[i+1]= next
                searchable_data.add(next)                 
                i += 1                                                      
    else:                                                                   
        return finaldata 
Brad
  • 1,367
  • 1
  • 8
  • 17