python unable to store function results in variable

Question

I wrote the following code to help me grab duplicate lines in a file and list out the line number of each duplicated line.

this code works when not in a function. But when I put the code inside a function as is shown below, it's not behaving like I'm expecting it to.

I want the values of the "getallDups" function to be stored in variable data.

#!/usr/bin/env python

filename = '/tmp/test.txt'
f = open(filename, "r")
contentAslist = f.read().splitlines()
def getallDups():
    lc = 0
    mystring = ""
    for eitem in contentAslist:
        lc += 1
        if contentAslist.count(eitem) > 1:
            mystring = lc,eitem
            return(mystring)

data = getallDups()
print data

The above code only stores the first duplicated line. it doesn't list all the duplicated lines.

How can this code be modified to do precisely what I want? How can it be modified to store the value of the defined function in the variable "data", which I can then play with.

"This code that contains a `return` statement works when it's not in a function" <- Yeah, I doubt that. — Aran-Fey, Jun 11 '18 at 21:08
@Aran-Fey I assume that if OP replaces the return with a print, then all duplicate lines are printed... — Serge Ballesta, Jun 11 '18 at 21:27
BTW, to make your code a **complete** minimal verifiable example, include the input as part of the program, as my solution does below. — Jeff Learman, Jun 12 '18 at 00:09

score 1 · Answer 1 · answered Jun 11 '18 at 21:24

1

Your trouble here is that you're returning within a loop, which means that you never get the remainder of your data. You could fix that by simply swapping return for yield and changing your retrieval call to:

data = list(getallDups())

This will allow your loop to complete fully.

answered Jun 11 '18 at 21:24

g.d.d.c

46,865
9
101
111

+1 for introducing `yield`, which is more pythonic than my solution. For more info about `yeld` and Python `generator`s, see https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do . However, it's still an O(N^2) solution, which is not a great idea. – Jeff Learman Jun 11 '18 at 23:57

score 1 · Accepted Answer · answered Jun 11 '18 at 21:25

You put a return statement in a loop inside a function: the return causes the function end at its first iteration... Possible ways are to return a list (and gather strings in the loop) or change the function to a generator.

Returning a list:

filename = '/tmp/test.txt'
f = open(filename, "r")
contentAslist = f.read().splitlines()
def getallDups():
    mylist = []
    lc = 0
    for eitem in contentAslist:
        lc += 1
        if contentAslist.count(eitem) > 1:
            mylist.append((lc, eitem))      # append the duplicated line to a list
    return mylist                           # return the fully populated list

data = getallDups()
print data

Generator version:

filename = '/tmp/test.txt'
f = open(filename, "r")
contentAslist = f.read().splitlines()
def getallDups():
    mylist = []
    lc = 0
    for eitem in contentAslist:
        lc += 1
        if contentAslist.count(eitem) > 1:
            yield (lc, eitem)    # yield duplicate lines one at a time

data = list(getallDups())        # build a list from the generator values
print data

+1 for introducing yield, which is more pythonic than my solution. For more info about yeld and Python generators, see stackoverflow.com/questions/231767/… . However, it's still an O(N^2) solution, which is not a great idea. — Jeff Learman, Jun 12 '18 at 00:14

Jeff Learman · Answer 3 · 2018-06-12T00:07:49.403

If you want it to return more results, it needs to calculate more results. Instead of returning the first match it finds, you need it to add that result to a list, and return the list:

contentAslist = [
    "abcd",
    "efgh",
    "abcd",
    "ijk",
    "lmno",
    "ijk",
    "lmno",
    "ijk",
]

def getallDups():
    lc = 0
    result = []
    for eitem in contentAslist:
        lc += 1
        if contentAslist.count(eitem) > 1:
            result.append((lc, eitem))
    return result

data = getallDups()
print data

However, this is a very inefficient method, O(N^2), because list.count() method is O(N) for N items in the list, and we call it N times.

A better way is to use a hash. Note that the return type here is very different, but might be more useful, and can easily be converted to your original form.

import collections
contentAslist = [
    "abcd",
    "efgh",
    "abcd",
    "ijk",
    "lmno",
    "ijk",
    "lmno",
    "ijk",
]
def getallDups():
    lc = 1
    # OrderedDict is same as "{}" except that when we iterate them later they're in the order that we added them.
    lhash = collections.OrderedDict()
    for line in contentAslist:
        # get list of line numbers matching this line, or empty list if it's the first
        line_numbers = lhash.get(line, [])
        # add this line number to the list
        line_numbers.append(lc)
        # Store the list of line numbers matching this line in the hash
        lhash[line] = line_numbers
        lc += 1

    return lhash

data = getallDups()

for line, line_numbers in data.iteritems():
    if len(line_numbers) > 1:
        print line, ":",
        for ln in line_numbers:
            print ln,
        print

The above solution is O(N).

Sample input:

abcd
efgh
abcd
ijk
lmno
ijk
lmno
ijk

Output:

abcd : 1 3
ijk : 4 6 8
lmno : 5 7

this makes the output cleaner by outputting the duplicated content once and listing the duplicated line numbers next to the content. brilliant! — swenson, Jun 11 '18 at 22:55
Please click the up-arrow (triangle) by the answer if you find it helpful. — Jeff Learman, Jun 12 '18 at 00:13
i dont have enough reputation points to do that. but when i do, i will upvote this because it is very helpful. thanks again! — swenson, Jun 12 '18 at 02:37
BTW, there's a better way to iterate though a sequence and also have the index. Above I use `lc` and increment it. A far better way is to use Python's built-in `enumerate()`: `for (index, value) in enumerate(sequence): ...` By default it starts with zero but you can pass the starting index. — Jeff Learman, Jun 25 '18 at 16:41

python unable to store function results in variable

3 Answers3