How to remove lines that has a duplicated value in a column from a CSV file?

Question

I have a txt file with this format:

 - 01, Spain
 - 02, USA
 - 03, India
 - 01, Italy
 - 01, Portugal
 - 04, Brasil

I need to check if the numbers are repeated. In this example, the number "01" has Spain, Italy and Portugal. If two or more lines have the same number, I need to keep only the first of the repeated number and get rid of the others. It would show this in the output file:

 - 01, Spain
 - 02, USA
 - 03, India
 - 04, Brasil

this is a everyweek answer, with just a few minutes of searching you could have found the answer. — Netwave, Mar 28 '16 at 07:28

Amadan · Answer 1 · 2022-06-09T14:54:34.573

-1

seen = set()
with open('in.txt', 'r'), open('out.txt', 'w') as fr, fw:
    for line in fr:
        row = line.split(',')
        if row[0] not in seen:
            fw.write(line)
            seen.add(row[0])

edited Jun 09 '22 at 14:54

answered Mar 28 '16 at 07:13

Amadan

191,408
23
240
301

What is `sets` and why would you use it if `set()` is already a built-in type? – Tomerikoo Jun 09 '22 at 10:01
@Tomerikoo Languages change. Once upon a time this was valid code, and `set` was [not a builtin](https://docs.python.org/2/library/sets.html). Though I have to admit that even when this answer was written it was already kind of obsolete. Thanks for the bump, I updated it. – Amadan Jun 09 '22 at 14:52
1

Sorry I didn't even check and was sure sets were always a built-in ^_^ – Tomerikoo Jun 09 '22 at 15:42

score -1 · Answer 2 · answered Mar 28 '16 at 07:13

import os
with open("file.txt", "r") as infile:
    numbers = set()
    f = open("_file.txt", "w")
    for line in infile:
        tokens = line.split(',')
        if int(tokens[0]) not in numbers:
            numbers.add(int(tokens[0]))
            f.write(line)
    f.close()
os.remove("file.txt")
os.rename("_file.txt", "file.txt")

score -1 · Accepted Answer · answered Mar 28 '16 at 07:18

# Read your entire file into memory.
my_file = 'my_file.txt'
with open(my_file) as f_in:
    content = f_in.readlines()

# Keep track of the numbers that have already appeared
# while rewriting the content back to your file.
numbers = []
with open(my_file, 'w') as f_out:
    for line in content:
        number, country = line.split(',')
        if not number in numbers:
            f_out.write(line)
            numbers.append(number)

I hope this is the easiest to understand.

`numbers` would be better as a set – Tomerikoo Jun 09 '22 at 10:00 — Tomerikoo, Jun 09 '22 at 10:00

How to remove lines that has a duplicated value in a column from a CSV file?

3 Answers3