More efficient way to retrieve first occurrence of every unique value from a csv column in Python

Question

A large csv I was given has a large table of flight data. A function I wrote to help parse it iterates over the column of Flight ID's, and then returns a dictionary containing the index and value of every unique Flight ID in order of first appearance.

Dictionary = { Index: FID, ... }

This comes as a quick adjustment to an older function that didn't require having to worry about FID repeats in the column (a few hundred thousand rows later...).

Right now, I have it iterating over and comparing each value in order. If a value is equal to the value after it, it skips it. If the next value is different, it stores the value in the dictionary. I changed it to now also check if that value has already occured before, and if so, to skip it.
Here's my code:

def DiscoverEarliestIndex(self, number):                                             
        finaldata = {}                                                        
        columnvalues = self.column(number)                                             
        columnenum = {}                                                         
        for a, b in enumerate(columnvalues):                                           
            columnenum[a] = b                                                   
        i = 0                                                                                                                    
        while i < (len(columnvalues) - 1):                                             
            next = columnenum[i+1]                                              
            if columnvalues[i] == next:                                                
                i += 1                                                          
            else:                                                               
                if next in finaldata.values():                                
                    i += 1                                                      
                    continue                                                    
                else:                                                           
                    finaldata[i+1]= next                                      
                    i += 1                                                      
        else:                                                                   
            return finaldata

It's very inefficient, and slows down as the dictionary grows. The column has 5.2 million rows, so it's obviously not a good idea to handle this much with Python, but I'm stuck with it for now.

Is there a more efficient way to write this function?

Aside: I'm not sure how well your nomenclature decisions -- both here and in a previous question -- will serve you professionally, although obviously your mileage may vary. It made me decide not to spend time pointing out certain problems, anyhow: life's too short. (Hint: what does `.values()` return? Why is membership testing with it a bad idea?) — DSM, Mar 19 '13 at 21:07
Didn't realize my pastor was on Stackoverflow, I will keep that in mind next time. — Adam Barthelson, Mar 19 '13 at 21:32

score 1 · Answer 1 · edited May 23 '17 at 10:32

1

You are essentially looking for a database. Databases are made exactly for such operations on large datasets. It will be much faster to parse the entire CSV at once using the CSV module and sending them in a database than storing them in a dict and running checks against the entire dict.

*large* python dictionary with persistence storage for quick look-ups

edited May 23 '17 at 10:32

Community

1
1

answered Mar 19 '13 at 20:43

nicbou

1,047
11
16

BenDundee · Answer 2 · 2013-03-19T21:34:16.527

To answer your question directly, you should be able to do this with dict comprehensions and the itertools module.

>>> import itertools as it
>>> data = {1: 'a', 2: 'a', 3: 'c', 4: 'c', 5:'d' }
>>> grouped_shit = {k: list(v) for (k,v) in it.groupby(data.iteritems(), lambda (_,v): v)}
>>> good_shit = {v[0][0]: k for (k, v) in grouped_shit.iteritems()}
>>> good_shit
{1: 'a', 3: 'c', 5: 'd'}

I think this can be tweeked a bit--I'm not super happy about going over the dict twice. But anyway, I think that the dict comprehensions are pretty efficient. Also, groupby assumes that your keys are in order---that is, it assumes that all the 'a's indices are grouped together, which is seems to be true in your case.

There's a small example of what the data looks like here in a previous question: [link](http://stackoverflow.com/questions/15148983/looking-for-a-more-efficient-way-to-reorganize-a-massive-csv-in-python) Note, all spaces have data in them in this case. — Adam Barthelson, Mar 19 '13 at 21:40

Brad · Accepted Answer · 2013-03-19T21:39:27.343

if next in thegoodshit.values():

is probably your problem what you are doing here is

creating a list
searching the list

maybe you can use a set to hold the values and search that - something like this:

    while i < (len(columnvalues) - 1):                                             
        next = columnenum[i+1]                                              
        if columnvalues[i] == next:                                                
            i += 1                                                          
        else:                                                               
            if next in searchable_data:                                
                i += 1                                                      
                continue                                                    
            else:                                                           
                finaldata[i+1]= next
                searchable_data.add(next)                 
                i += 1                                                      
    else:                                                                   
        return finaldata

I see what you mean, but I'm searching the values, not keys. — Adam Barthelson, Mar 19 '13 at 21:18
whoops good point - how about saving the values as a `set` as you go along and testing on that? — Brad, Mar 19 '13 at 21:20
Will give it a try, along with a database. Whichever one I find time for first. Thanks for pointing that out, easy to miss trivial stuff with a fried brain. — Adam Barthelson, Mar 19 '13 at 21:28

More efficient way to retrieve first occurrence of every unique value from a csv column in Python

3 Answers3