Copying a file without duplicate and blank lines in Python

Question

I have written a piece of code in Python to copy an existing text file (.txt) to a new file in the same location (with a different name). This copies all of the text from the original text file, as expected:

a=open("file1.txt", "r") #existing file
b=open("file2.txt", "w") #file did not previously exist, hence "w"
for reform1 in a.readlines():
    b.write(reform1) #write the lines from 'reform1'
    reform1=a.readlines() #read the lines in the file
a.close() #close file a (file1)
b.close() #close file b (file2)

I have now been asked to amend the new file, to remove both duplicate lines and blank lines from the file that were copied over (whilst preserving the original) and leaving the rest of the text (unique lines) as it is. How to do this?

What is meant by removing duplicate lines? Remove all lines that occur more than once? Remove only the lines that are duplicated after the first line? — Brian Schlenker, Nov 30 '16 at 16:47
You will have to keep track of all lines you have already seen and check each line against this record. A line would only be written if it is *not* in the record. — , Nov 30 '16 at 16:47
Possible duplicate of [Remove Duplicates from Text File](http://stackoverflow.com/questions/15830290/remove-duplicates-from-text-file) — Karan Nagpal, Nov 30 '16 at 16:50
You definitely don't need the `reform1=a.readlines()` line. Also: is a line considered a "duplicate" if it has *ever* been seen before, or only if it is identical to the line *immediately* above it? — jez, Nov 30 '16 at 16:58
Thank you so much for your responses! I will remove the refrom1=a.readlines() line and see how that works. — motoverdi, Nov 30 '16 at 18:04
Jez - good question! I am considering a line to be 'duplicate' if it has ever been seen before. — motoverdi, Nov 30 '16 at 18:06

score 2 · Accepted Answer · edited May 23 '17 at 10:29

2

This writes to 'file2.txt' all lines in 'file1.txt' apart from those that are made up of only whitespace or that are duplicates. The order is preserved but it is assumed that with duplicates only the first instance should be written:

seen = set()
with open('file1.txt') as f, open('file2.txt','w') as o:
    for line in f:
        if not line.isspace() and not line in seen:
            o.write(line)
            seen.add(line)

Note str.isspace() is True for all whitespace (e.g. tabs) not just newline characters, use if not line == '\n' for the stricter definition (assuming there are no '/r' newlines).

I handle opening/closing of files using the with statement and read the file line by line, which is the most pythonic way.

For just copying files in Python, you should use shutil as explained here.

edited May 23 '17 at 10:29

Community

1
1

answered Nov 30 '16 at 16:56

Chris_Rands

38,994
14
83
119

An improvement might be to just check that `not line in o.readlines()` rather than having a separate list – Peter Krasinski Nov 30 '16 at 18:02
@PeterKrasinski that would be incredibly inefficient because then each check would run in O(n) time as opposed to O(1). – Navidad20 Nov 30 '16 at 18:04
1

@PeterKrasinski `readlines()` returns a list, so first it would have to iterate over each line in the file, and then to check if a string existed, it would have to again iterate over each item in the list to check if it matched – Navidad20 Nov 30 '16 at 18:11
@Chris_Rands thanks for your help. I got as far as the "seen.add(line)" section before I had an error - AttributeError: 'dict' object has no attribute 'add'. If I remove this seen.add line, the program does part of what I want in deleting blank lines, but not duplicate lines. I am not sure how I would fix this? Thanks – motoverdi Dec 02 '16 at 12:57
@motoverdi My mistake, try the edited code now with `seen = set()` at the top – Chris_Rands Dec 02 '16 at 13:03
@Chris_Rands thank you, that is working excellent as far as I can tell! – motoverdi Dec 02 '16 at 16:42

Navidad20 · Answer 2 · 2016-11-30T16:51:58.007

1

Try this:

import re
a=open("file1.txt", "r") #existing file
b=open("file2.txt", "w") #file did not previously exist, hence "w"
exists = set()
for reform1 in a.readlines():
    if reform1 in exists:
        continue
    elif re.match(r'^\s$', reform1):
        continue
    else:
        b.write(reform1) #write the lines from 'reform1'
        exists.add(reform1)
a.close() #close file a (file1)
b.close() #close file b (file2)

edited Nov 30 '16 at 16:51

answered Nov 30 '16 at 16:47

Navidad20

832
7
11

This won't remove any blank lines, since each line will at least contain '\n'. You can instead check elif reform1.strip() – Brian Schlenker Nov 30 '16 at 16:50
How about that? I used regular expressions to match any whitespace lines – Navidad20 Nov 30 '16 at 16:53
3

As a side comment, this program might end up using too much memory depending upon the number of lines and the length of individual lines. An efficient approach would be to store a set of hash values for lines instead of lines themselves. – sid-m Nov 30 '16 at 17:00
@sid-m Using hashes is a nice idea. however, this code could first be optimised by just not using `readlines()`, which creates the whole list rather than being *lazy* – Chris_Rands Nov 30 '16 at 17:21

score 0 · Answer 3 · answered Nov 30 '16 at 17:00

Try:

a=open("file1.txt", "r") #existing file
b=open("file2.txt", "w") #file did not previously exist, hence "w"
seen = []
for reform1 in a.readlines():
    if reform1 not in seen and len(reform1) > 1:
        b.write(reform1) #write the lines from 'reform1'
        seen.append(reform1)
a.close() #close file a (file1)
b.close() #close file b (file2)

I used "len(reform1) > 1" because when I created my test file, the blank line had 1 character, presumably the "\r" or perhaps the "\n" character. Adjust this as necessary for your application.

Copying a file without duplicate and blank lines in Python

3 Answers3