2

I searched on here an found many postings, however none that I can implement into the following code

with open('TEST.txt') as f:
    seen = set()
    for line in f:
        line_lower = line.lower()
        if line_lower in seen and line_lower.strip():
            print(line.strip())
        else:
            seen.add(line_lower)

I can find the duplicate lines inside my TEST.txt file which contains hundreds of URLs.

However I need to remove these duplicates and create a new text file with these removed and all other URLs intact.

I will be Checking this newly created file for 404 errors using r.status_code.

In a nutshell I basically need help getting rid of duplicates so I can check for dead links. thanks for your help.

martineau
  • 119,623
  • 25
  • 170
  • 301
  • 1
    Here are a few posts that you could refer . http://stackoverflow.com/questions/1215208/how-might-i-remove-duplicate-lines-from-a-file http://stackoverflow.com/questions/15830290/remove-duplicates-from-text-file http://stackoverflow.com/questions/19876228/how-to-delete-duplicate-lines-in-a-file-in-python – xCodeZone Sep 20 '16 at 00:42
  • 1
    In addition to `seen.add(line_lower)` also `outfile.write(line)` in the `else` (assuming you also `open()` an `outfile` for writing before the `for` loop). – martineau Sep 20 '16 at 00:56
  • @martineau I have Test.txt & Clean.txt where it will have the duplicates removed. –  Sep 20 '16 at 01:36
  • Inserting line in f to a set will take care of the duplicate problem. You will get the unique lines in the file. Then you could write them to the Clean.txt, correct? – picmate 涅 Sep 20 '16 at 02:30
  • xNightmare67x: OK, change the first line to `with open('Test.txt') as f, open('Clean.txt', 'w') as outfile:`. – martineau Sep 20 '16 at 02:35

3 Answers3

1

Sounds simple enough, but what you did looks overcomplicated. I think the following should be enough:

with open('TEST.txt', 'r') as f:
    unique_lines = set(f.readlines())
with open('TEST_no_dups.txt', 'w') as f:
    f.writelines(unique_lines)

A couple things to note:

  • If you are going to use a set, you might as well dump all the lines at creation, and f.readlines(), which returns the list of all the lines in your file, is perfect for that.
  • f.writelines() will write a sequence of lines to your files, but using a set breaks the order of the lines. So if that matters to you, I suggest replacing the last line by f.writelines(sorted(unique_lines, key=whatever you need))
ursan
  • 2,228
  • 2
  • 14
  • 21
0

this is something you could use:

import linecache

with open('pizza11.txt') as f:
    for i, l in enumerate(f):
            pass
    x=i+1
    k=0
    i=2
    j=1
    initial=linecache.getline('pizza11.txt', 1)
    clean= open ('clean.txt','a')
    clean.write(initial)
    while i<(x+1):
        a=linecache.getline('pizza11.txt', i)
        while j<i:
            b=linecache.getline('pizza11.txt', j)
            if a==b:
                k=k+1
            j=j+1
        if k==0:
                clean= open ('clean.txt','a')
                clean.write(a)
        k=0
        j=1
        i=i+1

With this you are going through every line and checking with the ones before itself, if there are no matches with the previous written lines then it adds it on the document.

pizza11 is the name of a file I have on my computer which is a text file with a ton of stuff in a list that I use to try stuff like this out, you would just need to change that to whatever your starting file is. Your output file with no duplicates would be clean.txt

  • 2
    Using `linecache` plus the complexity of using it is completely unnecessary to accomplish what the OP want's to do, in my opinion. – martineau Sep 20 '16 at 02:40
0

Simpler than linecache and doesn't shuffle order like set

unique_lines = []
with open('file_in.txt', 'r') as f:
    for line in f.readlines():
        if line in unique_lines: continue
        unique_lines.append(line)
with open('file_out.txt', 'w') as f:
    f.writelines(unique_lines)

Old post, but I just had this question too, and this page was the first result.

  • Be aware that this does a full index traversal over the unique list on every iteration. Lists are not overall good for containment test. Maybe an ordered set? – Rodrigo Rodrigues Mar 24 '23 at 18:46