Efficiently find repeated characters in a string

Question

I know that the efficiency of this code is not optimal (esp. with gigantic inputs), and I know that there is a way to change this algorithm to handle other data types and not just a repetition in a string (obviously there are only so many characters to search through).

Is there any way I can increase efficiency here?

I tried using a dictionary and the function kept returning 'none' so I tried a list and things worked out fine.

Thanks ahead of time to anyone who can help me out!

def find_repeater(string):
    my_list = []
    my_list.append(string[0])

    for i in range (1, len(string)):

        if string[i] in my_list:
            print 'repetition found'
            return (string[i])

        else:
            my_list.append(string[i])

print find_repeater('abca')

now with a dictionary....(it keeps printing 'none' to the console)

def find_repeater(string):
    my_dict = {}
    my_dict[0] = string[0]

    for i in range (1, len(string)):

        if string[i] in my_dict:
            print 'repetition found'
            return string[i]

        else:
            my_dict[i] = string[i]

print find_repeater('abca')

using a dictionary is the way to go. Show what you tried, and we'll help you fix it. — Barmar, Sep 07 '14 at 00:54
@PadraicCunningham: looks like OP wants to return the letter that is repeated first — inspectorG4dget, Sep 07 '14 at 00:56
If you have working code, which just is not efficient enough, a code-review might be a good course (not yet the case). But anyway, consider using a proper title describing the problem you want help with, instead of just a cry for help. Though you are to be commended for only wanting a slight poke in the right direction instead of the whole solution. — Deduplicator, Sep 07 '14 at 00:59
If this code works, this question would be more appropriate on http://codereview.stackexchange.com/. — skrrgwasme, Sep 07 '14 at 01:00
you are creating a range for no reason, just iterate over the string, also `my_list = [string[0]]` will do what you want in one assignment — Padraic Cunningham, Sep 07 '14 at 01:04
Sorry everyone, it was my first post. Thanks for the support. Ill be more specific next time. — ChipSkylark, Sep 07 '14 at 01:05
@emodendroket: sorting costs O(nlogn) time. Not sure that's very efficient. Most of the bottleneck come from (I reckon) the `in my_list` lookups — inspectorG4dget, Sep 07 '14 at 01:06
@inspectorG4dget Briefly looking it seems like this runs in quadratic time... anyway the point of sorting is that then all the lookups are much faster. That may very well not be the best way to do it though. — Casey, Sep 07 '14 at 01:10

georg · Accepted Answer · 2014-09-19T08:41:20.313

As this is a performance question, let's do some timings:

def test_set(xs):
    seen = set()  # O(1) lookups
    for x in xs:
        if x not in seen:
            seen.add(x)
        else:
            return x

import collections

def test_counter(xs):
    freq = collections.Counter(xs)
    for k in freq:
        if freq[k] > 1:
            return k

def test_dict(xs):
    d = {}
    for x in xs:
        if x in d:
            return x
        d[x] = 1

def test_sort(xs):
    ys = sorted(xs)

    for n in range(1, len(xs)):
        if ys[n] == ys[n-1]:
            return ys[n]

##

import sys, timeit
print (sys.version + "\n")
xs = list(range(10000)) + [999]
fns = [p for name, p in globals().items() if name.startswith('test')]
for fn in fns:
    assert fn(xs) == 999
    print ('%50s %.5f' % (fn, timeit.timeit(lambda: fn(xs), number=100)))

I'm testing on an list of integers rather than a string (because with a string you can't get more than 256 loops). The results on my machine look like this:

3.2.3 (v3.2.3:3d0686d90f55, Apr 10 2012, 11:25:50) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]

                <function test_set at 0x1020f7380> 0.19265
               <function test_dict at 0x1020f7490> 0.12725
               <function test_sort at 0x1020f7518> 0.04683
            <function test_counter at 0x1020f7408> 0.92485

So the sort method appears to be the winner. I guess this is because it doesn't waste time creating hashes and allocating dict/set structures. Also, if you don't care about the source list being changed, you can do xs.sort() instead of ys = sorted(xs), which gives you zero memory footprint.

On the other side, if repeated items are more probable to occur towards the beginning of the input (as in xs = 'abcdef' * 10000), the set method will perform the best, as it, unlike sort or Counter, returns immediately once a repeat is found and doesn't need to preprocess the whole list. You should also use set if you need the first repeating element, not just one of them.

Counter is a nice tool, but it's not designed for performance, so if you really have to deal with "gigantic inputs", go with sets (if they fit in memory) or mergesort if they don't.

Wouldn't sort be at an advantage in this example because you already have a list that's sorted for all but one value? Putting random.shuffle(xs) after setting xs makes sort the worst performer, or about tied with counter. — Ryan McGrath, Aug 18 '20 at 23:18

Nir Alfasi · Answer 2 · 2014-09-07T01:18:27.037

4

You can use collections to find repeating characters:

import collections

freq = collections.Counter("abcda")
for k in freq:
    if freq[k] > 1:
        print k # prints "a"
        break

If you want only to find if there are repetitions (without finding the repeated characters):

letters = list("abcda")
no_rep = set(letters)
print len(letters) > len(no_rep) # prints 'True' when there are repeating characters

edited Sep 07 '14 at 01:18

answered Sep 07 '14 at 01:02

Nir Alfasi

53,191
11
86
129

don't think you need to cast the set to a list or make letters a list,`letters = ("abcda")` `len(letters) > len(set(letters))` – Padraic Cunningham Sep 07 '14 at 01:14
@PadraicCunningham right! (thx - fixed), I forgot to remove it (was playing with other options before posting it). – Nir Alfasi Sep 07 '14 at 01:19
1

no worries, +1 was, going to post a Counter answer too, just got too lazy :) – Padraic Cunningham Sep 07 '14 at 01:22

score 2 · Answer 3 · answered Sep 07 '14 at 01:00

2

def find_repeater(mystr):
    seen = set()  # O(1) lookups
    for char in mystr:
        if char not in seen:
            seen.add(char)
        else:
            print('repetition found')
            return char

answered Sep 07 '14 at 01:00

inspectorG4dget

110,290
27
149
241

OP asked for a dictionary. Should have given both answers, like I did – hd1 Sep 07 '14 at 01:00

score 1 · Answer 4 · answered Sep 07 '14 at 01:00

Here you go, using a dictionary:

def find_repeater(string):
    my_list = {}
    my_list[string[0]] = 1
    for i in range (1, len(string)):
        if string[i] in my_list.keys():
            print 'repetition found'
            return (string[i])
        else:
            my_list[string[i]] = 1
print find_repeater('abca')

Personally, though, I would use a set here:

def find_repeater(string):
    my_list = set()
    my_list.add(string[0])
    for i in range (1, len(string)):
        if string[i] in my_list:
            print 'repetition found'
            return (string[i])
        else:
            my_list.add(string[i])
print find_repeater('abca')

why use range instead of just iterating over the string? – Padraic Cunningham Sep 07 '14 at 01:03 — Padraic Cunningham, Sep 07 '14 at 01:03

score 0 · Answer 5 · answered Sep 07 '14 at 01:42

0

This should work.

def return_dupe(string):
    d = {}
    [d.update({k:2}) if k in d else d.update({k:1}) for k in string]
    return [key for key in d if d[key] > 1]

answered Sep 07 '14 at 01:42

amehta

1,307
3
18
22

score 0 · Answer 6 · answered May 18 '17 at 05:04

Another approach can be as simple as:

def repeting_count(text):

    working_text = str(text.lower())

    dup = []

    for char in working_text:
        if working_text.count(char) > 1:
            dup.append(char)

        else:
            pass

    for num in dup:
        while dup.count(num) > 1:
            dup.remove(num)

score 0 · Answer 7 · answered Jun 25 '18 at 00:28

def rec_char(s):
    buf={}
    for c in s:
        try:
            print(buf[c],end=", ")
        except:
            buf[c]=c


rec_char("ABCAB")

This function in python3 with time complexity O(n) will print the characters that appear more than once.

Efficiently find repeated characters in a string

7 Answers7

Linked