How do I check a text file line-by-line to detect if there are duplicates?

Question

I'm trying to have my function go through sorted text on Insults.txt and determine if there are duplicates and return false if there are, but I cannot seem to get it working. I am only trying to detect duplicates, not remove them! Does anybody know what I am doing wrong?

def checkInsultsFile(numInsults=1000, file="Insults.txt"):
    filename = open(file,'r').readlines()
    for i in range(0, numInsults):
        if [i] == [i+1]:
            return False
        else:
            return True

Good point Morgan - when I run the code, returns True even if there are duplicates in the file. — Brand Mellor, May 27 '16 at 17:04
Well, right now, as soon as it checks the first line, it returns. So it's only checking the first line. — Morgan Thrapp, May 27 '16 at 17:05
It is not even checking anything from file here [i] == [i+1] is always false. and it would return in the first check itself. — Vikas Madhusudana, May 27 '16 at 17:07
Thanks a lot for the feedback guys - do you know how I could navigate around this? As for the other post, I saw that but I am not trying to replace or remove duplicate lines, I am trying to detect them!! — Brand Mellor, May 27 '16 at 17:14
this is what you are looking for http://stackoverflow.com/questions/12937798/how-can-i-find-duplicate-lines-in-a-text-file-excluding-case-and-print-them — shivsn, May 27 '16 at 17:23
To expand slightly on what @VikasMadhusudana is getting at above, `[i]` and `[i+1]` are not lines from the file; they're integers. You've defined `i` at the start of your for loop to be `range(0,1000)`. Putting [square brackets] around the numbers just turns them into lists with one item (which is still an integer). If you want to look at individual lines of your text, you need to use `filename[i]`, which in your code would be the `i`'th line of the file. — A_S00, May 27 '16 at 17:33
Also: `filename` is a bad variable name for this purpose, since it doesn't end up holding the file name, it holds the text of the file (due to the use of `.readlines()`). — A_S00, May 27 '16 at 17:35

score 1 · Answer 1 · answered May 27 '16 at 17:17

Try this, I am not sure why you are having numInsults

def checkInsultsFile(numInsults=1000, file="Insults.txt"):
    lines = open(file, 'r').readlines()

    dict = {}

    for line in lines:
            dict[line] = dict.get(line,0) + 1

    for k,v in dict.iteritems():
            if v > 1:
                    return True
    return False

st.ph.n · Answer 2 · 2016-05-31T18:53:08.977

I'm not sure why you are limiting the numInsults either, if you want to check the whole file, if the number of lines is greater than 1K.

def checkInsultsFile(file):
    with open(file, 'r') as f:
        lines = [line.strip() for line in f] #puts whole file into list if it's not too large for your RAM
    check = set(lines)
    if len(lines) == len(check):
         return False
    elif len(check) < len(lines):
         return True

checkInsultsFile("Insults.txt")

Alternative (run through file line by line):

def checkInsultsFile(file):
    lines = []
    with open(file, 'r') as f:
        for line in f:
             lines.append(line.strip()) 

    check = set(lines)
    if len(lines) == len(check):
         return False
    elif len(check) < len(lines):
         return True

checkInsultsFile("Insults.txt")

This function will take all the lines in Insults.txt into a list. 'Check' is a set, which will only keep unique items in the 'lines' list. If the lines list is equal to the check list, there are no duplicates, and return False. If the check list is smaller than the lines list, you know there were duplicates, and will return True.

Alternatively, you can use bash (don't know your OS). Just to point out there are faster/simpler ways to do this, unless your python script will utilize the unique list of insults from the file in other ways:

sort Insults.txt | uniq -c

This is similar to what you can do with Counter from collections in Python, which will give you a count of all the lines in the file.

This answer has some potential, I think, but isn't quite there imo. Why are you noting what can be done with bash in a python question? Any links about that counter stuff? Is it really neccessary to preallocate the whole file when its iterable? The file is sorted... does that help? Why do you need that last elsif, is there any chance the set is *larger* than the list? Maybe you don't need to address although those pedantic things (or even most), but a few more details would be nice — en_Knight, May 27 '16 at 19:09

score 1 · Answer 3 · answered May 28 '16 at 03:18

Mine's a lazier approach, as its execution will stop as soon as it finds a duplicate.

def checkInsultsFile(filename):
    with open(filename, 'r') as file:
        s = set()
        for line in file:
            if line in s:
                 return True
            s.add(line)
        return False
    except IOError:
        handleExceptionFromFileError()

score 0 · Answer 4 · answered May 28 '16 at 03:35

What is happening

if [i] == [i+1]:
    return False
else:
    return True

Initially, i is 0. Is a one-element list that contains 0 equal to a one-element list that contains 1? Clearly not. So execution goes to the else clause, and the function returns True.

It doesn't even care about the length or the contents of the file, as long as it exists and is readable.

A working solution

Take a cue from the itertools recipe for pairwise(iterable), which produces pairs (line1, line2), (line2, line3), (line3, line4), etc.

Also, use the any() function to simplify the inner loop.

from itertools import tee

def any_consecutive_duplicate_lines(file='Insults.txt'):
    """Return True if the file contains any two consecutive equal lines."""
    with open(file) as f:
        a, b = tee(f)
        next(b, None)
        return any(a_line == b_line for a_line, b_line in zip(a, b))

EoinS · Answer 5 · 2016-05-28T03:57:42.067

If you need to return if there are any dupes we can take your function and simplify a little bit:

def checkdup(file = "insults.txt")
  lines = open(file, 'r').readlines()
  return len(lines) != len(set(lines))

Basically we do two things: take all lines in txt and make them a list, check that the number of items in that list

len(lines) #the number of insults in your file.

are the same as the number of items in a collection of unique elements of that list

len(set(lines)) # the number of unique elements of our list, or unique insults

If they're not the same, there must be dupes!

How do I check a text file line-by-line to detect if there are duplicates?

5 Answers5

What is happening

A working solution