0

I wrote the following code to help me grab duplicate lines in a file and list out the line number of each duplicated line.

this code works when not in a function. But when I put the code inside a function as is shown below, it's not behaving like I'm expecting it to.

I want the values of the "getallDups" function to be stored in variable data.

#!/usr/bin/env python

filename = '/tmp/test.txt'
f = open(filename, "r")
contentAslist = f.read().splitlines()
def getallDups():
    lc = 0
    mystring = ""
    for eitem in contentAslist:
        lc += 1
        if contentAslist.count(eitem) > 1:
            mystring = lc,eitem
            return(mystring)

data = getallDups()
print data

The above code only stores the first duplicated line. it doesn't list all the duplicated lines.

How can this code be modified to do precisely what I want? How can it be modified to store the value of the defined function in the variable "data", which I can then play with.

swenson
  • 67
  • 6

3 Answers3

1

Your trouble here is that you're returning within a loop, which means that you never get the remainder of your data. You could fix that by simply swapping return for yield and changing your retrieval call to:

data = list(getallDups())

This will allow your loop to complete fully.

g.d.d.c
  • 46,865
  • 9
  • 101
  • 111
  • +1 for introducing `yield`, which is more pythonic than my solution. For more info about `yeld` and Python `generator`s, see https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do . However, it's still an O(N^2) solution, which is not a great idea. – Jeff Learman Jun 11 '18 at 23:57
1

You put a return statement in a loop inside a function: the return causes the function end at its first iteration... Possible ways are to return a list (and gather strings in the loop) or change the function to a generator.

Returning a list:

filename = '/tmp/test.txt'
f = open(filename, "r")
contentAslist = f.read().splitlines()
def getallDups():
    mylist = []
    lc = 0
    for eitem in contentAslist:
        lc += 1
        if contentAslist.count(eitem) > 1:
            mylist.append((lc, eitem))      # append the duplicated line to a list
    return mylist                           # return the fully populated list

data = getallDups()
print data

Generator version:

filename = '/tmp/test.txt'
f = open(filename, "r")
contentAslist = f.read().splitlines()
def getallDups():
    mylist = []
    lc = 0
    for eitem in contentAslist:
        lc += 1
        if contentAslist.count(eitem) > 1:
            yield (lc, eitem)    # yield duplicate lines one at a time

data = list(getallDups())        # build a list from the generator values
print data
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • +1 for introducing yield, which is more pythonic than my solution. For more info about yeld and Python generators, see stackoverflow.com/questions/231767/… . However, it's still an O(N^2) solution, which is not a great idea. – Jeff Learman Jun 12 '18 at 00:14
1

If you want it to return more results, it needs to calculate more results. Instead of returning the first match it finds, you need it to add that result to a list, and return the list:

contentAslist = [
    "abcd",
    "efgh",
    "abcd",
    "ijk",
    "lmno",
    "ijk",
    "lmno",
    "ijk",
]

def getallDups():
    lc = 0
    result = []
    for eitem in contentAslist:
        lc += 1
        if contentAslist.count(eitem) > 1:
            result.append((lc, eitem))
    return result

data = getallDups()
print data

However, this is a very inefficient method, O(N^2), because list.count() method is O(N) for N items in the list, and we call it N times.

A better way is to use a hash. Note that the return type here is very different, but might be more useful, and can easily be converted to your original form.

import collections
contentAslist = [
    "abcd",
    "efgh",
    "abcd",
    "ijk",
    "lmno",
    "ijk",
    "lmno",
    "ijk",
]
def getallDups():
    lc = 1
    # OrderedDict is same as "{}" except that when we iterate them later they're in the order that we added them.
    lhash = collections.OrderedDict()
    for line in contentAslist:
        # get list of line numbers matching this line, or empty list if it's the first
        line_numbers = lhash.get(line, [])
        # add this line number to the list
        line_numbers.append(lc)
        # Store the list of line numbers matching this line in the hash
        lhash[line] = line_numbers
        lc += 1

    return lhash

data = getallDups()

for line, line_numbers in data.iteritems():
    if len(line_numbers) > 1:
        print line, ":",
        for ln in line_numbers:
            print ln,
        print

The above solution is O(N).

Sample input:

abcd
efgh
abcd
ijk
lmno
ijk
lmno
ijk

Output:

abcd : 1 3
ijk : 4 6 8
lmno : 5 7
Jeff Learman
  • 2,914
  • 1
  • 22
  • 31
  • 1
    this makes the output cleaner by outputting the duplicated content once and listing the duplicated line numbers next to the content. brilliant! – swenson Jun 11 '18 at 22:55
  • Please click the up-arrow (triangle) by the answer if you find it helpful. – Jeff Learman Jun 12 '18 at 00:13
  • 1
    i dont have enough reputation points to do that. but when i do, i will upvote this because it is very helpful. thanks again! – swenson Jun 12 '18 at 02:37
  • 1
    BTW, there's a better way to iterate though a sequence and also have the index. Above I use `lc` and increment it. A far better way is to use Python's built-in `enumerate()`: `for (index, value) in enumerate(sequence): ...` By default it starts with zero but you can pass the starting index. – Jeff Learman Jun 25 '18 at 16:41