Remove duplicates from large list but remove both if it does exist?

Question

So I have a text file like this

123
1234
123
1234
12345
123456

You can see 123 appears twice so both instances should be removed. but 12345 appears once so it stays. My text file is about 70,000 lines.

Here is what I came up with.

file = open("test.txt",'r')
lines = file.read().splitlines() #to ignore the '\n' and turn to list structure
for appId in lines:
    if(lines.count(appId) > 1):  #if element count is not unique remove both elements
        lines.remove(appId)      #first instance removed
        lines.remove(appId)      #second instance removed


writeFile = open("duplicatesRemoved.txt",'a') #output the left over unique elements to file
for element in lines:
    writeFile.write(element + "\n")

When I run this I feel like my logic is correct, but I know for a fact the output is suppose to be around 950, but Im still getting 23000 elements in my output so a lot is not getting removed. Any ideas where the bug could reside?

Edit: I FORGOT TO MENTION. An element can only appear twice MAX.

Does this answer your question? [How to remove items from a list while iterating?](https://stackoverflow.com/questions/1207406/how-to-remove-items-from-a-list-while-iterating) — mkrieger1, Dec 05 '19 at 21:39
What if an element repeated 3 times? (or any odd number) Do you want to delete all of them or you want to just delete pairs? — Hamidreza, Dec 05 '19 at 21:46
Use a context manager to handle files, don’t modify something while you’re iterating over it. Also, variable and function names should follow the `lower_case_with_underscores` style. — AMC, Dec 05 '19 at 22:20

Osman Mamun · Accepted Answer · 2019-12-05T21:56:21.650

Use Counter from built in collections:

In [1]: from collections import Counter

In [2]: a = [123, 1234, 123, 1234, 12345, 123456]

In [3]: a = Counter(a)

In [4]: a
Out[4]: Counter({123: 2, 1234: 2, 12345: 1, 123456: 1})


In [5]: a = [k for k, v in a.items() if v == 1]

In [6]: a
Out[6]: [12345, 123456]

For your particular problem I will do it like this:

from collections import defaultdict
out = defaultdict(int)
with open('input.txt') as f:
    for line in f:
        out[line.strip()] += 1
with open('out.txt', 'w') as f:
     for k, v in out.items():
         if v == 1: #here you use logic suitable for what you want
             f.write(k + '\n')

score 2 · Answer 2 · answered Dec 05 '19 at 21:40

Be careful about removing elements from a list while still iterating over that list. This changes the behavior of the list iterator, and can make it skip over elements, which may be part of your problem.

Instead, I suggest creating a filtered copy of the list using a list comprehension - instead of removing elements that appear more than twice, you would keep elements that appear less than that:

file = open("test.txt",'r')
lines = file.read().splitlines()
unique_lines = [line for line in lines if lines.count(line) <= 2]  # if it appears twice or less

with open("duplicatesRemoved.txt", "w") as writefile:
    writefile.writelines(unique_lines)

You could also easily modify this code to look for only one occurrence (if lines.count(line) == 1) or for more than two occurrences.

Better use a `Counter` rather than call `.count` repeatedly; this solution is unnecessarily O(n^2). — kaya3, Dec 05 '19 at 21:52

Hamidreza · Answer 3 · 2019-12-05T22:17:54.770

You can count all of the elements and store them in a dictionary:

dic = {a:lines.count(a) for a in lines}

Then remove all duplicated one from array:

for k in dic:
    if dic[k]>1:
        while k in lines:
            lines.remove(k)

NOTE: The while loop here is becaues line.remove(k) removes first k value from array and it must be repeated till there's no k value in the array.

If the for loop is complicated, you can use the dictionary in another way to get rid of duplicated values:

lines = [k for k, v in dic.items() if v==1]

Remove duplicates from large list but remove both if it does exist?

3 Answers3