How to select only lines that have a unique (not duplicated) field?

Question

How might I remove lines that have duplicates the first part of a line?

Example:

input file : include

line 1 : Messi , 1 
line 2 : Messi , 2
line 3 : CR7 , 2

I want the output file to be:

line 1: CR7 , 2

Just CR7 , 2; I want to delete the lines that have duplicate first fields (e.g., Messi). The file is not sorted.

The deletion depends on the first column. If there is any match for the first column in the file, then I want to delete the line

How to do this in Python? Here is my code so far:

lines_seen = set() # holds lines already seen
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
    if line not in lines_seen: # not a duplicate
     outfile.write(line)
     lines_seen.add(line)
outfile.close()

This sample has the large original and the known duplicates.

Hey @MOHA7z, that really is not the proper way to do things here, asking the same question twice within 4 hours although the other one has answers?! — LeoE, Nov 15 '19 at 22:30
Does this answer your question? [How to Remove duplicate lines from a text file and the unique related to this duplicate](https://stackoverflow.com/questions/58880450/how-to-remove-duplicate-lines-from-a-text-file-and-the-unique-related-to-this-du) — LeoE, Nov 15 '19 at 22:31
@LeoE There is no answer working until now Bro , why i ask more if it solved ? — MOHA7z, Nov 15 '19 at 22:38

Charles Merriam · Accepted Answer · 2019-11-16T03:05:32.053

1

There are a few ways.

You might want to read How do I find the duplicates in a list and create another list with them?

One answer from that, using your code:

from counter import Counter

with open(infilename, 'r') as inp:
    lines = inp.readlines()
output_lines = [line for line, count in collections.Counter(lines).items() if count > 1]
with open(outfilename, "w") as out:
     out.write("\n".join(output_lines))

Being provided with a sample, its a slightly different question. Here is your solution:

import collections
from typing import List

def remove_duplicate_first_columns(lines: List[str]) -> List[str]:
    first_col = [line.split(',')[0] for line in lines]
    dups = [col for col, count in collections.Counter(first_col).items() if count > 1]
    non_dups = [line for line in lines if line.split(',')[0] not in dups]
    return non_dups


with open('input.csv') as inp:
    lines = inp.readlines()
non_dups = remove_duplicate_first_columns(lines)
with open('nondups.csv', 'w') as out:
    print(''.join(non_dups), file=out)
print(f"There were {len(lines) - len(non_dups)} lines removed.")
print("This program is gratified to be of use")

I hope this completely answers your question.

edited Nov 16 '19 at 03:05

answered Nov 15 '19 at 19:47

Charles Merriam

19,908
6
73
83

this code working when i use 10 lines just when i give to the code large file contain 2000 line does not delete anything do you know what is the problem ? – MOHA7z Nov 15 '19 at 21:51
No idea. I did make a typo: it should be `out.write(''.join(output_lines))` as the lines have a \n already. I ran it on a 18K line file and it found the 4K duplicate lines. Check that the 2000 line input uses correct line terminators? – Charles Merriam Nov 16 '19 at 01:22
Hold it, this is for duplicate lines. If you want unique lines, use 'count == 1'. You may have no output if your 2000 lines are all unique. – Charles Merriam Nov 16 '19 at 01:23
Still not working also the same does not delete the duplicate and what related too , here is 2 file >>> https://files.fm/u/sdb7ddhf ( input file and Duplicate file ) ( run your code with input file and after this check if you found any line that in Duplicate file in the input file and tell if it work ) Thanks for helping me – MOHA7z Nov 16 '19 at 01:50
The deletion depends on the first row if there is a match in the first row I want to delete this line – MOHA7z Nov 16 '19 at 02:27
Editted with new solution and cleaned up your question. As a memory trick: columns are up/down like Roman columns while rows are left/right like rowing a boat. – Charles Merriam Nov 16 '19 at 03:04
Thank you very much – MOHA7z Nov 16 '19 at 03:09

score 0 · Answer 2 · answered Nov 15 '19 at 19:47

You need to be able to remove something that was added earlier, so you cant directly dump to outfile.write(line). Instead use an accumulator to keep the data, and only once the full processing of the input is done, commit to writing the output.

lines_seen = set() # holds lines already seen
accumulator = []

with open(infilename, "r") as f:
    for line in f.readlines():   
        if line not in lines_seen: # not a duplicate
            accumulator.append(line)
            lines_seen.add(line)
        else:
            accumulator.remove(line)

outfile = open(outfilename, "w")
outfile.write('\n'.join(accumulator))
outfile.close()

this code working when i use 10 lines just when i give to the code large file contain 2000 line does not delete anything do you know what is the problem ? — MOHA7z, Nov 15 '19 at 21:25

Kousik · Answer 3 · 2019-11-16T15:04:00.923

0

Here is another solution you might check it out.

lines_seen = set()
outfile = open(outfilename, "w")
with open(infilename, "r") as f:
    lines = f.readlines()
    outfile.write([line for line in lines if not (line.split(",")[0] in lines_seen or lines_seen.add(line.split(",")[0])])
outfile.close()

You can get some more info here! How do you remove duplicates from a list whilst preserving order?

edited Nov 16 '19 at 15:04

answered Nov 15 '19 at 19:56

Kousik

465
3
15

this code working when i use 10 lines just when i give to the code large file contain 2000 line does not delete anything do you know what is the problem ? – MOHA7z Nov 15 '19 at 21:50

How to select only lines that have a unique (not duplicated) field?

3 Answers3