How might I remove duplicate lines from a file?

Question

I have a file with one column. How to delete repeated lines in a file?

Vinay Sajip · Answer 1 · 2009-07-31T23:12:07.810

82

On Unix/Linux, use the uniq command, as per David Locke's answer, or sort, as per William Pursell's comment.

If you need a Python script:

lines_seen = set() # holds lines already seen
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
    if line not in lines_seen: # not a duplicate
        outfile.write(line)
        lines_seen.add(line)
outfile.close()

Update: The sort/uniq combination will remove duplicates but return a file with the lines sorted, which may or may not be what you want. The Python script above won't reorder lines, but just drop duplicates. Of course, to get the script above to sort as well, just leave out the outfile.write(line) and instead, immediately after the loop, do outfile.writelines(sorted(lines_seen)).

edited Jul 31 '09 at 23:12

answered Jul 31 '09 at 22:46

Vinay Sajip

95,872
14
179
191

1

You need to run sort before you run uniq because uniq will only remove lines if they're identical to the previous line. – David Locke Jul 31 '09 at 22:54
Yes - I referred to your answer but didn't reiterate that it was sort followed by uniq. – Vinay Sajip Jul 31 '09 at 23:08
7

+1 for this solution. One further enhancement might be to store the md5 sum of the line, and compare the current line's md5 sum. This should significantly cut down on the memory requirements. (see http://docs.python.org/library/md5.html) – joeslice Jul 31 '09 at 23:18
+1 I could have written this code myself but why write when you can google :) – Adam Gent Sep 11 '12 at 15:20
This answer give `Traceback (most recent call last): File "sort and unique.py", line 5, in outfile.write(line) MemoryError ` how to resolve it – Jaffer Wilson Mar 29 '17 at 12:32
@VinaySajip understand that you need to add something that you want to delete into `lines_seen = set()`. What if I have a line `Word Syllable Phone` how exactly should I do it? – dgr379 Mar 09 '18 at 17:10
Is this still the best answer in 2021? – MasayoMusic Jan 09 '21 at 03:42

score 46 · Answer 2 · answered Jul 31 '09 at 22:43

46

If you're on *nix, try running the following command:

sort <file name> | uniq

answered Jul 31 '09 at 22:43

David Locke

17,926
9
33
53

got memory leaking on a 3GB file. used 30GB of my 32GB RAM – svonidze May 07 '23 at 11:22
1

the command dumps all lines into terminal, in order to save them into a file instead use `>>` like `sort | uniq >> ` – svonidze May 07 '23 at 11:45

score 26 · Answer 3 · edited Jan 20 '22 at 17:29

26

uniqlines = set(open('/tmp/foo').readlines())

this will give you the list of unique lines.

writing that back to some file would be as easy as:

bar = open('/tmp/bar', 'w').writelines(uniqlines)

bar.close()

edited Jan 20 '22 at 17:29

Marco Bonelli

63,369
21
118
128

answered Aug 01 '09 at 12:51

marcell

1,332
9
10

2

True, but the lines will be in some random order according to how they hash. – Vinay Sajip Aug 01 '09 at 15:42
The problem with this code is that after you write, and the last line does not have an '\n'. Then the output results will have one lines with merged 2 lines. – wmlynarski Nov 02 '17 at 09:42
1

your solution is good for smaller files. where file size is upto 300mb or 400mb. not beyond that. – Shravan Yadav Mar 18 '18 at 06:23

shahjapan · Answer 4 · 2021-07-05T12:19:03.937

8

get all your lines in the list and make a set of lines and you are done. for example,

>>> x = ["line1","line2","line3","line2","line1"]
>>> list(set(x))
['line3', 'line2', 'line1']
>>>

If you need to preserve the ordering of lines - as set is unordered collection - try this:

y = []
for l in x:
    if l not in y:
        y.append(l)

and write the content back to the file.

edited Jul 05 '21 at 12:19

answered Aug 01 '09 at 15:18

shahjapan

13,637
22
74
104

3

True, but the lines will be in some random order according to how they hash. – Vinay Sajip Aug 01 '09 at 15:43

MLSC · Answer 5 · 2014-06-19T04:55:40.420

7

You can do:

import os
os.system("awk '!x[$0]++' /path/to/file > /path/to/rem-dups")

Here You are using bash into python :)

You have also other way:

with open('/tmp/result.txt') as result:
        uniqlines = set(result.readlines())
        with open('/tmp/rmdup.txt', 'w') as rmdup:
            rmdup.writelines(set(uniqlines))

edited Jun 19 '14 at 04:55

answered Jun 07 '14 at 13:15

MLSC

5,872
8
55
89

score 6 · Answer 6 · answered Mar 31 '15 at 10:29

Its a rehash of whats already been said here - here what I use.

import optparse

def removeDups(inputfile, outputfile):
        lines=open(inputfile, 'r').readlines()
        lines_set = set(lines)
        out=open(outputfile, 'w')
        for line in lines_set:
                out.write(line)

def main():
        parser = optparse.OptionParser('usage %prog ' +\
                        '-i <inputfile> -o <outputfile>')
        parser.add_option('-i', dest='inputfile', type='string',
                        help='specify your input file')
        parser.add_option('-o', dest='outputfile', type='string',
                        help='specify your output file')
        (options, args) = parser.parse_args()
        inputfile = options.inputfile
        outputfile = options.outputfile
        if (inputfile == None) or (outputfile == None):
                print parser.usage
                exit(1)
        else:
                removeDups(inputfile, outputfile)

if __name__ == '__main__':
        main()

score 4 · Answer 7 · answered Sep 15 '13 at 09:16

4

Python One liners :

python -c "import sys; lines = sys.stdin.readlines(); print ''.join(sorted(set(lines)))" < InputFile > OutputFile

answered Sep 15 '13 at 09:16

Rahul Patil

1,014
3
14
30

score 4 · Answer 8 · answered Jan 27 '17 at 13:18

4

adding to @David Locke's answer, with *nix systems you can run

sort -u messy_file.txt > clean_file.txt

which will create clean_file.txt removing duplicates in alphabetical order.

answered Jan 27 '17 at 13:18

All Іѕ Vаиітy

24,861
16
87
111

This would remove duplicates, but modify (sort) the order of the lines. Not exactly what was asked. – jcarballo Dec 12 '18 at 20:05

score 3 · Answer 9 · answered May 10 '18 at 19:12

Look at script I created to remove duplicate emails from text files. Hope this helps!

# function to remove duplicate emails
def remove_duplicate():
    # opens emails.txt in r mode as one long string and assigns to var
    emails = open('emails.txt', 'r').read()
    # .split() removes excess whitespaces from str, return str as list
    emails = emails.split()
    # empty list to store non-duplicate e-mails
    clean_list = []
    # for loop to append non-duplicate emails to clean list
    for email in emails:
        if email not in clean_list:
            clean_list.append(email)
    return clean_list
    # close emails.txt file
    emails.close()
# assigns no_duplicate_emails.txt to variable below
no_duplicate_emails = open('no_duplicate_emails.txt', 'w')

# function to convert clean_list 'list' elements in to strings
for email in remove_duplicate():
    # .strip() method to remove commas
    email = email.strip(',')
    no_duplicate_emails.write(f"E-mail: {email}\n")
# close no_duplicate_emails.txt file
no_duplicate_emails.close()

Torkoal · Answer 10 · 2019-07-04T04:07:46.970

If anyone is looking for a solution that uses a hashing and is a little more flashy, this is what I currently use:

def remove_duplicate_lines(input_path, output_path):

    if os.path.isfile(output_path):
        raise OSError('File at {} (output file location) exists.'.format(output_path))

    with open(input_path, 'r') as input_file, open(output_path, 'w') as output_file:
        seen_lines = set()

        def add_line(line):
            seen_lines.add(line)
            return line

        output_file.writelines((add_line(line) for line in input_file
                                if line not in seen_lines))

What is the point of adding the HASH of the line to the set when you could just add the line itself? — xrisk, May 13 '18 at 14:44

score 2 · Answer 11 · answered Apr 01 '20 at 22:58

edit it within the same file

lines_seen = set() # holds lines already seen

with open("file.txt", "r+") as f:
    d = f.readlines()
    f.seek(0)
    for i in d:
        if i not in lines_seen:
            f.write(i)
            lines_seen.add(i)
    f.truncate()

Ravgeet Dhillon · Answer 12 · 2020-10-26T09:54:42.333

2

Readable and Concise

with open('sample.txt') as fl:
    content = fl.read().split('\n')

content = set([line for line in content if line != ''])

content = '\n'.join(content)

with open('sample.txt', 'w') as fl:
    fl.writelines(content)

edited Oct 26 '20 at 09:54

answered Oct 26 '20 at 07:21

Ravgeet Dhillon

532
2
6
24

it's better to use with open context manager to open the file so that it's closed safely. Also reading the whole in one go may not work if the file is big. – himanshu219 Oct 26 '20 at 08:18

score 1 · Answer 13 · answered Jun 28 '13 at 02:15

Here is my solution

if __name__ == '__main__':
f = open('temp.txt','w+')
flag = False
with open('file.txt') as fp:
    for line in fp:
        for temp in f:
            if temp == line:
                flag = True
                print('Found Match')
                break
        if flag == False:
            f.write(line)
        elif flag == True:
            flag = False
        f.seek(0)
    f.close()

score 0 · Answer 14 · answered Jun 11 '21 at 09:00

0

cat <filename> | grep '^[a-zA-Z]+$' | sort -u > outfile.txt

To filter and remove duplicate values from the file.

answered Jun 11 '21 at 09:00

Ashwaq

431
7
17

Karree · Answer 15 · 2022-09-10T17:52:26.223

Here is my solution

d = input("your file:") #write your file name here
file1 = open(d, mode="r")
file2 = open('file2.txt', mode='w')
file2 = open('file2.txt', mode='a')
file1row = file1.readline()


while file1row != "" :
    file2 = open('file2.txt', mode='a')
    file2read = open('file2.txt', mode='r')
    file2r = file2read.read().strip()
    if file1row not in file2r:
        file2.write(file1row)   
    file1row = file1.readline()
    file2read.close()
    file2.close

How might I remove duplicate lines from a file?

15 Answers15

Linked

Related