How might I remove lines that have duplicates the first part of a line?
Example:
input file : include
line 1 : Messi , 1
line 2 : Messi , 2
line 3 : CR7 , 2
I want the output file to be:
line 1: CR7 , 2
Just CR7 , 2
; I want to delete the lines that have duplicate first fields (e.g., Messi
). The file is not sorted.
The deletion depends on the first column. If there is any match for the first column in the file, then I want to delete the line
How to do this in Python? Here is my code so far:
lines_seen = set() # holds lines already seen
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
if line not in lines_seen: # not a duplicate
outfile.write(line)
lines_seen.add(line)
outfile.close()
This sample has the large original and the known duplicates.