56

I'm trying to learn python. Here is the relevant part of the exercise:

For each word, check to see if the word is already in a list. If the word is not in the list, add it to the list.

Here is what I've got.

fhand = open('romeo.txt')
output = []

for line in fhand:
    words = line.split()
    for word in words:
        if word is not output:
            output.append(word)

print sorted(output)

Here is what I get.

['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and', 'and',
 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is', 'is',
 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun',
 'the', 'the', 'the', 'through', 'what', 'window', 'with', 'yonder']

Note duplication (and, is, sun, etc).

How do I get only unique values?

Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
Tim Elhajj
  • 835
  • 1
  • 9
  • 16
  • 5
    The idiomatic way is to maintain a *set* of words to check against. All those linear scans over a growing list makes an otherwise linear algorithm degrade to quadratic. – John Coleman Feb 19 '17 at 23:31

5 Answers5

97

To eliminate duplicates from a list, you can maintain an auxiliary list and check against.

myList = ['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and', 'and', 
     'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is', 'is', 'kill', 'light', 
     'moon', 'pale', 'sick', 'soft', 'sun', 'sun', 'the', 'the', 'the', 
     'through', 'what', 'window', 'with', 'yonder']

auxiliaryList = []
for word in myList:
    if word not in auxiliaryList:
        auxiliaryList.append(word)

output:

['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 
  'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick',
  'soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']

This is very simple to comprehend and code is self explanatory. However, code simplicity comes on the expense of code efficiency as linear scans over a growing list makes a linear algorithm degrade to quadratic.


If the order is not important, you could use set()

A set object is an unordered collection of distinct hashable objects.

Hashability makes an object usable as a dictionary key and a set member, because these data structures use the hash value internally.

Since the average case for membership checking in a hash-table is O(1), using a set is more efficient.

auxiliaryList = list(set(myList))

output:

['and', 'envious', 'already', 'fair', 'is', 'through', 'pale', 'yonder', 
 'what', 'sun', 'Who', 'But', 'moon', 'window', 'sick', 'east', 'breaks', 
 'grief', 'with', 'light', 'It', 'Arise', 'kill', 'the', 'soft', 'Juliet']
Tony Tannous
  • 14,154
  • 10
  • 50
  • 86
  • hey I had the same problem , but here consider a list that grows exponentially almost daily , how do you handle that , I have done the same using a while and pop , but its really slow – dev Jan 12 '21 at 14:48
  • @ETL_Devs better ask a new question with context to where are you getting the data from? are you writing it to disk at end of day? etc... – Tony Tannous Jan 12 '21 at 15:35
  • With existing list `l` and new list `n` to add only uniques from I like to write it with the `extend` method like this, since it reads easy without a loop: `l.extend([i for i in n if i not in l])` – kontur Jun 09 '21 at 14:22
27

Instead of is not operator, you should use not in operator to check whether the item is in the list:

if word not in output:

BTW, using set is a lot efficient (See Time complexity):

with open('romeo.txt') as fhand:
    output = set()
    for line in fhand:
        words = line.split()
        output.update(words)

UPDATE The set does not preserve the original order. To preserve the order, use the set as an auxiliary data structure:

output = []
seen = set()
with open('romeo.txt') as fhand:
    for line in fhand:
        words = line.split()
        for word in words:
            if word not in seen:  # faster than `word not in output`
                seen.add(word)
                output.append(word)
falsetru
  • 357,413
  • 63
  • 732
  • 636
  • 1
    The exercise calls for a list where the words will be ordered according to the order of their first appearance, so I don't see how `set()` could *replace* the list, though it would clearly be a helpful auxiliary data structure. – John Coleman Feb 19 '17 at 23:36
  • @JohnColeman, Thanks for the comment. I thought it doesn't matter, because OP used `sorted` at the end of the code. I will update the answer to include the version that preserve the order. – falsetru Feb 19 '17 at 23:38
  • I see what you mean. I took that to be their way of checking for duplicates. The problem description itself didn't specify sorting the result. – John Coleman Feb 19 '17 at 23:41
  • would it not be more efficient to build the set and convert it to a list at the end, outside of the loop? rather than appending repeatedly? – PatrickT Jun 14 '20 at 06:28
  • 1
    @PatrickT, The first code does as you mentioned. As mentioned in the answer `The set does not preserve the original order.`; if you want the result result to keep order of appearance in the field, you should do it like the second code. – falsetru Jun 14 '20 at 07:40
8

One method is to see if it's in the list prior to adding, which is what Tony's answer does. If you want to delete duplicate values after the list has been created, you can use set() to convert the existing list into a set of unique values, and then use list() to convert it into a list again. All in just one line:

list(set(output))

If you want to sort alphabetically, just add a sorted() to the above. Here's the result:

['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 
 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 
 'soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']
Barbaros Özhan
  • 59,113
  • 10
  • 31
  • 55
Advait Saravade
  • 3,029
  • 29
  • 34
  • 2
    If order doesn't matter then IMO this is the best answer so far as it will perform linearly and is also more concise. It does produces extra objects but that i okay for a lab/one-off program. This list -> set -> list pattern is used quite commonly – rsjethani Jun 27 '18 at 08:03
6

Here's a "one-liner" which uses this implementation of removing duplicates while preserving order:

def unique(seq):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]

output = unique([word for line in fhand for word in line.split()])

The last line flattens fhand into a list of words, and then calls unique() on the resulting list.

Community
  • 1
  • 1
Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
0

fh = open('romeo.txt')
content = fh.read()
words = content.split()

mylist = list()
for word in words:
    if word not in mylist:
        mylist.append(word)

mylist.sort()
print(mylist)

fh.close()

7ud02
  • 1
  • 1
  • Hi @7ud02, welcome on SO. Here are two hints, first is it easier for users to understand your answer when you and some prosa to your code. Seconde, the question was about to avoid duplicates in the list. And I think you answer doesn't solve this problem. Please think about updating your answer. – mosc9575 Jan 12 '21 at 15:58
  • First you read the content of the file, then create a list by using split(). Afterwards it adds only unique words to the newly created list(). In the end you sort and print the list. So there are NO duplicates at all. – 7ud02 Jan 12 '21 at 17:14
  • Your approche looks similar to the one in the question, so I was confused. Futhermore I want to tell you to never use keywords of the python language as a variable. This can cause some very difficult errors. – mosc9575 Jan 12 '21 at 18:16
  • 1
    Hello, Welcome to Stack Overflow! While this code may answer the question, providing additional context regarding why and/or how this code answers the question improves its long-term value. – ppwater Jan 13 '21 at 01:08