My question looks like a classic one, but I cannot find the exact same question in stackoverflow. I hope mine is not a duplicate question.
I have a large file. The file has many rows and fixed columns. I am interested in columns A and B among all columns. The goal is that I would like to get rows, where (1) the value in Column A in the row appears in other rows as well, and (2) there is more than one row that has the same value of Column A but a different value of Column B.
Consider the following table. I am interested in rows 1,3, and 5 because "a" appears in 3 rows, and the values in Column B are different. In contrast, I am not interested in rows 2 and 4 because "b" appears twice, but its corresponding value in Column B is always "1". Similarly, I am not interested in row 6 because "c" appears only once.
# A B C D ========= 1 a 0 x x 2 b 1 x x 3 a 2 x x 4 b 1 x x 5 a 3 x x 6 c 1 x x
To find such columns, I read all lines in the file, convert each line with an object, create list for the objects, and find interesting columns with the following algorithm. The algorithm works, but takes time for my dataset. Do you have any suggestions to make the algorithm efficient?
def getDuplicateList(oldlist):
# find duplicate elements
duplicate = set()
a_to_b = {}
for elements in oldlist:
a = elements.getA()
b = elements.getB()
if a in a_to_b:
if b != a_to_b[a]:
duplicate.add(a)
a_to_b[a] = b
# get duplicate list
newlist = []
for elements in oldlist:
a = elements.getA()
if a in duplicate:
newlist.append(a)
return newlist
p.s. I add some constraints to clarify.
- I am using Python 2.7
- I need "all interesting rows":
duplicate
has "some" interesting "a"s. - Order is important
- In fact, the data is memory accesses of a program execution. Column A has memory accesses, and Column B has some conditions that I am interested in. If a memory access has several conditions in runtime, then I would like to investigate the sequence of the memory access.