Python Duplicate words

Question

I have a question where I have to count the duplicate words in Python (v3.4.1) and put them in a sentence. I used counter but I don't know how to get the output in this following order. The input is:

mysentence = As far as the laws of mathematics refer to reality they are not certain as far as they are certain they do not refer to reality

I made this into a list and sorted it

The output is suppose to be this

"As" is repeated 1 time.
"are" is repeated 2 times.
"as" is repeated 3 times.
"certain" is repeated 2 times.
"do" is repeated 1 time.
"far" is repeated 2 times.
"laws" is repeated 1 time.
"mathematics" is repeated 1 time.
"not" is repeated 2 times.
"of" is repeated 1 time.
"reality" is repeated 2 times.
"refer" is repeated 2 times.
"the" is repeated 1 time.
"they" is repeated 3 times.
"to" is repeated 2 times.

I have come to this point so far

x=input ('Enter your sentence :')
y=x.split()
y.sort()
for y in sorted(y):
    print (y)

You should take a look at the collections.Counter class. It is very relevant to your use case. — Chris Arena, Sep 11 '14 at 23:56
@ChrisArena: He says right in the first sentence "I used counter"… — abarnert, Sep 11 '14 at 23:57
Why are all of your variables called `y`? Are you trying to make your code confusing, or are most of your other keys broken? — abarnert, Sep 11 '14 at 23:58
@abarnert I wasn't sure he meant that in the official 'Counter' sense, more that he was trying to count them. And clearly it's not in his code. — Chris Arena, Sep 12 '14 at 00:08
@ChrisArena: You may be right. It's hard to tell from a vague question like this. — abarnert, Sep 12 '14 at 00:17

sberry · Accepted Answer · 2014-09-12T00:34:11.887

14

I can see where you are going with sort, as you can reliably know when you have hit a new word and keep track of counts for each unique word. However, what you really want to do is use a hash (dictionary) to keep track of the counts as dictionary keys are unique. For example:

words = sentence.split()
counts = {}
for word in words:
    if word not in counts:
        counts[word] = 0
    counts[word] += 1

Now that will give you a dictionary where the key is the word and the value is the number of times it appears. There are things you can do like using collections.defaultdict(int) so you can just add the value:

counts = collections.defaultdict(int)
for word in words:
    counts[word] += 1

But there is even something better than that... collections.Counter which will take your list of words and turn it into a dictionary (an extension of dictionary actually) containing the counts.

counts = collections.Counter(words)

From there you want the list of words in sorted order with their counts so you can print them. items() will give you a list of tuples, and sorted will sort (by default) by the first item of each tuple (the word in this case)... which is exactly what you want.

import collections
sentence = """As far as the laws of mathematics refer to reality they are not certain as far as they are certain they do not refer to reality"""
words = sentence.split()
word_counts = collections.Counter(words)
for word, count in sorted(word_counts.items()):
    print('"%s" is repeated %d time%s.' % (word, count, "s" if count > 1 else ""))

OUTPUT

"As" is repeated 1 time.
"are" is repeated 2 times.
"as" is repeated 3 times.
"certain" is repeated 2 times.
"do" is repeated 1 time.
"far" is repeated 2 times.
"laws" is repeated 1 time.
"mathematics" is repeated 1 time.
"not" is repeated 2 times.
"of" is repeated 1 time.
"reality" is repeated 2 times.
"refer" is repeated 2 times.
"the" is repeated 1 time.
"they" is repeated 3 times.
"to" is repeated 2 times.

edited Sep 12 '14 at 00:34

answered Sep 12 '14 at 00:12

sberry

128,281
18
138
165

I think it is better to use .split() than .split(" ") because the latter will add '' and '\n' to the list and we dont want to consider them 'words'. .split() eats white spaces and new lines. – Piotr Dabkowski Sep 12 '14 at 00:20
2

Great explanation! One minor quibble: When (pre-)explaining the equivalent of what `Counter` does, it's probably better to use `word not in counts` instead of `not counts.get(word)`. Besides being more idiomatic, and more correct in other (non-`Counter`) cases where falsey values can be valid, it makes it clearer that you're checking that this is a new word that's not been seen before. – abarnert Sep 12 '14 at 00:21
@PiotrDabkowski: Good point. But the OP's posted input doesn't have any newlines. And if his real input does, I'm willing to bet it also has punctuation, which means we need something more than `str.split` anyway (whether `re.findall`, `re.split`, `str.split` plus `str.translate`, …). – abarnert Sep 12 '14 at 00:23
Thanks very much sberry I understand completely :) thanks for including the explanations as well. I am sorry but I am new to programming :) – Erwy Lionel Sep 12 '14 at 00:30
@PiotrDabkowski: Fair point on `split()`, but like @abarnert says, as soon as we introduce punctuation `str.split()` is out anyway. – sberry Sep 12 '14 at 00:34
Can you explain the print line to me @sberry? – Erwy Lionel Sep 12 '14 at 00:38
@ErwyLionel: See [`printf`-Style Formatting](https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting) for details, and also look at the more flexible and sometimes-simpler (but also sometimes less simple, which is why we still have both…) [`format` function and `str.format` method](https://docs.python.org/3/library/string.html#format-string-syntax). – abarnert Sep 12 '14 at 01:18
1

+1 for `sorted(Counter(words))`. I've provided [`groupby(sorted(words))` solution](http://stackoverflow.com/a/25799168/4279). It doesn't matter for the OP but here's a [performance comparison Counter vs. defaultdict vs. groupby](http://stackoverflow.com/a/13656047/4279). – jfs Sep 12 '14 at 01:18
@J.F.Sebastian: The performance metrics aren't too surprising; anything that requires sorting has an O(NlogN) step with a reasonably high constant; anything that only hashes doesn't is linear and (unless the values are very expensive to hash) cheap, and that difference should dominate everything else. At least until you have so much data that allocation/swapping costs start to dominate. – abarnert Sep 12 '14 at 01:31
@abarnert: `Counter` is slower than `defaultdict` that is slower than `groupby()` if the input is sorted -- it is not obvious. – jfs Sep 12 '14 at 02:28
@J.F.Sebastian: If the input is already sorted, `groupby` is faster; if not, it's slower, because of the need to sort—e.g., for the `count_words` test it's 1.56s vs. 501ms. Exactly as you'd expect. And Counter being slower than defaultdict is also not surprising because the latter is implemented in C (although as of either 3.3 or 3.4 the slowest part of `Counter` is now C-accelerated too, so the results might be closer). – abarnert Sep 12 '14 at 04:10
@abarnert: hindsight is 20/20. Experience shows that you should measure the time performance even if you think that you know the result already. – jfs Sep 13 '14 at 13:48
@J.F.Sebastian: Of course. But you should also know how to predict performance, and how to understand it once measured, so you know which variations are worth testing, and even more importantly which measurements are surprising and worth looking into further to learn more. – abarnert Sep 13 '14 at 20:00

score 2 · Answer 2 · answered Sep 12 '14 at 00:59

To print word duplicates from a string in the sorted order:

from itertools import groupby 

mysentence = ("As far as the laws of mathematics refer to reality "
              "they are not certain as far as they are certain "
              "they do not refer to reality")
words = mysentence.split() # get a list of whitespace-separated words
for word, duplicates in groupby(sorted(words)): # sort and group duplicates
    count = len(list(duplicates)) # count how many times the word occurs
    print('"{word}" is repeated {count} time{s}'.format(
            word=word, count=count,  s='s'*(count > 1)))

Output

"As" is repeated 1 time
"are" is repeated 2 times
"as" is repeated 3 times
"certain" is repeated 2 times
"do" is repeated 1 time
"far" is repeated 2 times
"laws" is repeated 1 time
"mathematics" is repeated 1 time
"not" is repeated 2 times
"of" is repeated 1 time
"reality" is repeated 2 times
"refer" is repeated 2 times
"the" is repeated 1 time
"they" is repeated 3 times
"to" is repeated 2 times

Probably overkill… but any chance for someone to learn `groupby` is a good thing. :) — abarnert, Sep 12 '14 at 01:14

HimanshuGahlot · Answer 3 · 2017-12-08T19:30:26.743

Hey i have tried it on python 2.7(mac) as i have that version so try to get hold of the logic

from collections import Counter

mysentence = """As far as the laws of mathematics refer to reality they are not certain as far as they are certain they do not refer to reality"""

mysentence = dict(Counter(mysentence.split()))
for i in sorted(mysentence.keys()):
    print ('"'+i+'" is repeated '+str(mysentence[i])+' time.')

I hope this is what you are looking for if not then ping me up happy to learn something new.

"As" is repeated 1 time.
"are" is repeated 2 time.
"as" is repeated 3 time.
"certain" is repeated 2 time.
"do" is repeated 1 time.
"far" is repeated 2 time.
"laws" is repeated 1 time.
"mathematics" is repeated 1 time.
"not" is repeated 2 time.
"of" is repeated 1 time.
"reality" is repeated 2 time.
"refer" is repeated 2 time.
"the" is repeated 1 time.
"they" is repeated 3 time.
"to" is repeated 2 time.

isamert · Answer 4 · 2014-09-12T09:10:28.843

Here is a very bad example of doing this without using anything other than lists:

x = "As far as the laws of mathematics refer to reality they are not certain as far as they are certain they do not refer to reality"
words = x.split(" ")
words.sort()

words_copied = x.split(" ")
words_copied.sort()

for word in words:
    count = 0
    while(True):
        try:
            index = words_copied.index(word)
            count += 1
            del words_copied[index]
        except ValueError:
            if count is not 0:
                print(word + " is repeated " + str(count) + " times.")
            break

EDIT: Here is a much better way:

x = "As far as the laws of mathematics refer to reality they are not certain as far as they are certain they do not refer to reality"
words = x.split(" ")
words.sort()

last_word = ""
for word in words:
    if word != last_word:
        count = [i for i, w in enumerate(words) if w == word]
        print(word + " is repeated " + str(len(count)) + " times.")
    last_word = word

Sam S. · Answer 5 · 2022-08-11T04:57:05.920

A solution based on numpy array and based on post How do I count the occurrence of a certain item in an ndarray?:

mysentence = """As far as the laws of mathematics refer to reality they are not certain as far as they are certain they do not refer to reality"""
import numpy as np
mysentence = np.array(mysentence.split(" "))
words, frq = np.unique(mysentence, return_counts=True)

for item in zip(words,frq):                  
    print(f'"{item[0]}" is repeated {item[1]} time.')

Output:

"As" is repeated 1 time.
"are" is repeated 2 time.
"as" is repeated 3 time.
"certain" is repeated 2 time.
"do" is repeated 1 time.
"far" is repeated 2 time.
"laws" is repeated 1 time.
"mathematics" is repeated 1 time.
"not" is repeated 2 time.
"of" is repeated 1 time.
"reality" is repeated 2 time.
"refer" is repeated 2 time.
"the" is repeated 1 time.
"they" is repeated 3 time.
"to" is repeated 2 time.

Sk337 · Answer 6 · 2023-01-19T11:52:48.107

0

If string is "miamimiamimiamimiamimiamimiamimiamimiami" or "San FranciscoSan FranciscoSan FranciscoSan FranciscoSan FranciscoSan FranciscoSan FranciscoSan FranciscoSan Francisco"

import re

String="San FranciscoSan FranciscoSan FranciscoSan FranciscoSan FranciscoSan FranciscoSan FranciscoSan FranciscoSan Francisco"
word=""
for i in String:
    word+=i
    if String=="".join(re.findall(word,String)):
        print(a)
        break

edited Jan 19 '23 at 11:52

answered Jan 19 '23 at 11:52

Sk337

1
2

This code will not work as expected, as the if statement will always evaluate to false. The re.findall(word,String) function returns a list of all non-overlapping matches of the regular expression word in the string String. – Viktor Liehr Jan 20 '23 at 19:59
It is working fine I already checked. It will return "San Francisco". Can you check it once. Also re.findall will return list of word which we give as variable "word". And word is keep changing as for loop goes and when word will become "San Francisco" findall will give all San Francisco and if we join all it will become our original string. – Sk337 Jan 21 '23 at 20:16

Python Duplicate words

6 Answers6

Output