How can I filter duplicate rows efficiently in Python?

Question

My question looks like a classic one, but I cannot find the exact same question in stackoverflow. I hope mine is not a duplicate question.

I have a large file. The file has many rows and fixed columns. I am interested in columns A and B among all columns. The goal is that I would like to get rows, where (1) the value in Column A in the row appears in other rows as well, and (2) there is more than one row that has the same value of Column A but a different value of Column B.

Consider the following table. I am interested in rows 1,3, and 5 because "a" appears in 3 rows, and the values in Column B are different. In contrast, I am not interested in rows 2 and 4 because "b" appears twice, but its corresponding value in Column B is always "1". Similarly, I am not interested in row 6 because "c" appears only once.

# A B C D
=========
1 a 0 x x
2 b 1 x x
3 a 2 x x
4 b 1 x x
5 a 3 x x
6 c 1 x x

To find such columns, I read all lines in the file, convert each line with an object, create list for the objects, and find interesting columns with the following algorithm. The algorithm works, but takes time for my dataset. Do you have any suggestions to make the algorithm efficient?

def getDuplicateList(oldlist):
    # find duplicate elements
    duplicate = set()
    a_to_b = {}
    for elements in oldlist:
        a = elements.getA()
        b = elements.getB()
        if a in a_to_b:
            if b != a_to_b[a]:
                duplicate.add(a)
        a_to_b[a] = b 

    # get duplicate list
    newlist = []
    for elements in oldlist:
        a = elements.getA()
        if a in duplicate:
            newlist.append(a)

    return newlist

p.s. I add some constraints to clarify.

I am using Python 2.7
I need "all interesting rows": duplicate has "some" interesting "a"s.
Order is important
In fact, the data is memory accesses of a program execution. Column A has memory accesses, and Column B has some conditions that I am interested in. If a memory access has several conditions in runtime, then I would like to investigate the sequence of the memory access.

Out of curiosity, how many lines do you have: hundreds, thousands, tens of thousands? — Cristian Ciupitu, Feb 28 '11 at 20:30
Each file has hundreds of thousands, and I have hundreds of such files. It takes several minutes for the processing. — Sangmin, Feb 28 '11 at 20:34
Would the case of (b 1), (b 1), (b 2) be "interesting"? The second value is duplicated, but there's another value *also*. — Michael Lorton, Feb 28 '11 at 20:47
It sounds like what you want is a database. Have you thought about using http://docs.python.org/library/sqlite3.html (or a more heavy-weight database)? — Adam Rosenfield, Feb 28 '11 at 21:07
This is a surprising definition of "duplicate": (a0, a2, a3) are "interesting", but (b1, b1) (which look like duplicates to me) are not "interesting". Please explain. — John Machin, Feb 28 '11 at 22:30

score 0 · Answer 1 · answered Feb 28 '11 at 21:08

0

Does the original order need to be maintained? If not, it looks pretty similar to what groupby does, and you might get some performance boost out of using built-in methods.

Perhaps something like this (untested!):

s = sorted(oldlist, key=lambda e: (e.getA(), e.getB()))
interesting = (g for k,g in itertools.groupby(s, lambda e: e.getA())
               if len(g) > 1)

answered Feb 28 '11 at 21:08

Ken

726
1
5
7

Sorting the list is O(n log n), and he already has a solution that is amortized O(n). What he needs is a faster O(n). – Karl Bielefeldt Feb 28 '11 at 21:18

score 0 · Answer 2 · answered Feb 28 '11 at 22:03

Your complexity is already pretty good. You're just looking for a linear speedup here.

Is there a reason you can't just return duplicate instead of doing the second loop?

If you add an else you can avoid reinserting a_to_b[a] = b when it is already there.

Also, disk I/O is slow and has lots of time where the CPU is available for other things while waiting for a read. Since you have a lot of these to do, you can probably get a significant speedup by having one thread find duplicates while another thread is reading the next file.

score 0 · Answer 3 · answered Feb 28 '11 at 22:30

The following is extremely easy. It yields the A value of the interesting rows; modifying it to yield the rows would be simple:

def isInteresting(rows):
    avals = {}
    for row in rows:
        bvals = avals.get(row.getA()) or set()
        bvals.add(rowgetB())
        avals[row.getA()] = bvals

    return [ aval
             for aval in avals.keys()
             if avals[aval] and len(avals[aval]) > 1 ]

score 0 · Accepted Answer · answered Feb 28 '11 at 22:34

Well, the two iterations over the elements in oldlist can be replaced by a single iteration. I believe this would, in most cases, improve the efficiency of your algorithm, particularly for long lists.

If the order of newlist is of no concern to you, I propose a single-loop replacement that has the same result as your algorithm. I have tested it against randomly generated million-element lists and it always runs in approximately half the time:

def new_getDuplicateList(oldlist):
    # find duplicate elements
    newlist = []
    duplicate = set()
    a_to_b = {}
    for elements in oldlist:
        a = elements[0]
        b = elements[1]
        if a in duplicate:
            newlist.append(a)
        else:
            if a in a_to_b.keys():
                if not b in a_to_b[a]:
                    a_to_b[a].append(b)
                    duplicate.add(a)
                    extension = [a for i in a_to_b[a]]
                    newlist.extend(extension)
                else:
                    a_to_b[a].append(b)
            else:
                a_to_b[a] = [b]

    return newlist

(Probably the conditionals could be made prettier.) It would be very easy to modify it to output the entire rows instead of just the a values, just replacing a by (a, b) wherever necessary. Also note that it consumes a bit more of memory than the first algorithm, due to the a_to_b dicts, which now hold lists.

score 0 · Answer 5 · answered Mar 01 '11 at 15:35

Creating objects out of the different items in your list is likely to cause some slowdown. Here I'm just using the collections module to create a multiset and letting the container itself sort out the irrelevant items. See how this works for you. I'm assuming the exact file format you gave above.

import collections

def get_interesting_items(filename):
    multiset = collections.defaultdict(set)

    with open(filename) as f:
        # skip header lines
        f.readline()
        f.readline()

        # add all B items to Bset, indexed by A
        for line in f:
            _, a, b, _ = line.split(' ', 3)
            multiset[a].add(int(b))

        # generate all A, Bset pairs where Bset contains at least 2 items.
        for a, bset in multiset.iteritems():
            if len(bset) >= 2:
                yield a, bset

def main():
    for a, bset in get_interesting_items('myfile.txt'):
        print a, bset

I see in your comments above that order is important for you. In that case, you can use an OrderedSet (http://stackoverflow.com/questions/1653970/does-python-have-an-ordered-set) instead of a normal set in the first line of get_interesting_items. — Brandon, Mar 01 '11 at 15:45

How can I filter duplicate rows efficiently in Python?

5 Answers5