Find "one letter that appears twice" in a string

Question

I'm trying to catch if one letter that appears twice in a string using RegEx (or maybe there's some better ways?), for example my string is:

ugknbfddgicrmopn

The output would be:

dd

However, I've tried something like:

re.findall('[a-z]{2}', 'ugknbfddgicrmopn')

but in this case, it returns:

['ug', 'kn', 'bf', 'dd', 'gi', 'cr', 'mo', 'pn']   # the except output is `['dd']`

I also have a way to get the expect output:

>>> l = []
>>> tmp = None
>>> for i in 'ugknbfddgicrmopn':
...     if tmp != i:
...         tmp = i
...         continue
...     l.append(i*2)
...     
... 
>>> l
['dd']
>>>

But that's too complex...

If it's 'abbbcppq', then only catch:

abbbcppq
 ^^  ^^

So the output is:

['bb', 'pp']

Then, if it's 'abbbbcppq', catch bb twice:

abbbbcppq
 ^^^^ ^^

So the output is:

['bb', 'bb', 'pp']

You can use backreference, [`([a-z])\1`](https://regex101.com/r/wT7cA9/1) — Tushar, Dec 14 '15 at 07:01
You seem to expect -- but don't mention -- contiguity, and you don't explain what'd you want as a result if `"ddd"` were present. — DSM, Dec 14 '15 at 07:05
@Tushar what if he wants to find those which appears exactly two? like fetch `dd` from `fddf` not from `fdddf` — Avinash Raj, Dec 14 '15 at 07:08
@KevinGuan you need to use findall for getting more than one occurances. `re.search('([a-z])\1', 'ugknbfddgicrmopn').group()` — Avinash Raj, Dec 14 '15 at 07:11
@AvinashRaj: Huh? Tested on Python 2.7 and Python 3.5, both raise `AttributeError: 'NoneType' object has no attribute 'group'`. — Remi Guan, Dec 14 '15 at 07:12
@KevinGuan what would be your expected output if the input is `abbbbcppq` — Avinash Raj, Dec 14 '15 at 07:12
@AvinashRaj: Sure, catch `bb` twice, so it's `['bb', 'bb', 'pp']`. — Remi Guan, Dec 14 '15 at 07:13
Do you mean letter appears twice **together** or **anywhere in input**? — Dima Tisnek, Dec 14 '15 at 11:05
Quite Similar [\[python\]: use re to find consecutively repeated chars](http://stackoverflow.com/q/7147796) — Bhargav Rao, Dec 14 '15 at 11:54

Avinash Raj · Accepted Answer · 2015-12-14T09:06:19.543

51

You need use capturing group based regex and define your regex as raw string.

>>> re.search(r'([a-z])\1', 'ugknbfddgicrmopn').group()
'dd'
>>> [i+i for i in re.findall(r'([a-z])\1', 'abbbbcppq')]
['bb', 'bb', 'pp']

or

>>> [i[0] for i in re.findall(r'(([a-z])\2)', 'abbbbcppq')]
['bb', 'bb', 'pp']

Note that , re.findall here should return the list of tuples with the characters which are matched by the first group as first element and the second group as second element. For our case chars within first group would be enough so I mentioned i[0].

edited Dec 14 '15 at 09:06

answered Dec 14 '15 at 07:13

Avinash Raj

172,303
28
230
274

Ohhhhh, after re-read the answer I understand how does it work now. So `([a-z])` catch the first letter, and `\1` repeat it. :) – Remi Guan Dec 14 '15 at 08:06
3

@KevinGuan ya, exactly.. `()` called capturing group. So `([a-z])` captures the first letter and the following `\1` is a back-refernce to the first capturing group. So `\1` refers all the characters which are matched by the first group. – Avinash Raj Dec 14 '15 at 08:51

Mazdak · Answer 2 · 2015-12-15T06:24:29.393

32

As a Pythonic way You can use zip function within a list comprehension:

>>> s = 'abbbcppq'
>>>
>>> [i+j for i,j in zip(s,s[1:]) if i==j]
['bb', 'bb', 'pp']

If you are dealing with large string you can use iter() function to convert the string to an iterator and use itertols.tee() to create two independent iterator, then by calling the next function on second iterator consume the first item and use call the zip class (in Python 2.X use itertools.izip() which returns an iterator) with this iterators.

>>> from itertools import tee
>>> first = iter(s)
>>> second, first = tee(first)
>>> next(second)
'a'
>>> [i+j for i,j in zip(first,second) if i==j]
['bb', 'bb', 'pp']

Benchmark with `RegEx` recipe:

# ZIP
~ $ python -m timeit --setup "s='abbbcppq'" "[i+j for i,j in zip(s,s[1:]) if i==j]"
1000000 loops, best of 3: 1.56 usec per loop

# REGEX
~ $ python -m timeit --setup "s='abbbcppq';import re" "[i[0] for i in re.findall(r'(([a-z])\2)', 'abbbbcppq')]"
100000 loops, best of 3: 3.21 usec per loop

After your last edit as mentioned in comment if you want to only match one pair of b in strings like "abbbcppq" you can use finditer() which returns an iterator of matched objects, and extract the result with group() method:

>>> import re
>>> 
>>> s = "abbbcppq"
>>> [item.group(0) for item in re.finditer(r'([a-z])\1',s,re.I)]
['bb', 'pp']

Note that re.I is the IGNORECASE flag which makes the RegEx match the uppercase letters too.

edited Dec 15 '15 at 06:24

answered Dec 14 '15 at 07:03

Mazdak

105,000
18
159
188

Well, then as my edit, I want `bb` from `abbbc`. Okay I know that this is a short version of mine another example and that example of mine's output isn't the expect of my edit...sorry about that... – Remi Guan Dec 14 '15 at 07:11
@KevinGuan In that case you need a set comprehension. – Mazdak Dec 14 '15 at 07:13
Well, if use `set`, then it can't catch `bb` twice as I said in comments of my question. – Remi Guan Dec 14 '15 at 07:16
Ah, yeah. This is good :). But however, I'd like use regex in this case. Good to know there's *another usage* of `zip()` :D – Remi Guan Dec 14 '15 at 07:20
@KevinGuan Yes but using `regex` is not pythonic at all, checkout the bec=nchmark result. – Mazdak Dec 14 '15 at 11:45
3

@Kasramvd: Performance isn't everything, and is most likely irrelevant in this case. Regexps are a tool to solve a problem, and they solve this problem clearly and concisely. – Vincent Savard Dec 14 '15 at 13:29
@VincentSavard Yep performance is not everything, but when? Actually most important points about a code is performance (in terms of memory use and run time) and then readability ( coding style, amount of code and etc.) And as you can see clearly the first approach is really more optimum and readable than regex recipe. but about the second one which is not very complicated, the main point is that it's very optimized in terms of memory use which in dealing with large data sets would be pretty much better and usable. – Mazdak Dec 14 '15 at 14:51
2

Performance is important when you profiled your program and determined which parts were bottlenecks. There's absolutely no reason to be concerned about performance in this context. Thus, readability should be the main factor, and this is completely subjective in this case. – Vincent Savard Dec 14 '15 at 14:56
@VincentSavard Yes, and that's what the first part does. – Mazdak Dec 14 '15 at 15:08
You cannot make this claim as there is no context to determine if this is a bottleneck. – Vincent Savard Dec 14 '15 at 15:09
@VincentSavard I'm not talking about bottlenecks, I just suggested a pythoic way, that's all and at the rest of my answer just suggested another approaches for another situations, which might be useful for OP and future readers. – Mazdak Dec 14 '15 at 15:12
2

that. because there is no reason to use a regex here. – njzk2 Dec 14 '15 at 16:37
2

@njzk2: One pretty good reason is that this does not do what OP wants for the string `abbbc` (e.g. `['bb']` for the regexp vs `['bb', 'bb']` for this code). – Vincent Savard Dec 14 '15 at 16:53
@VincentSavard OP has added the lat part after my answer, and make I remove my regex approach which exactly did that job (before accepted answer).actually he edited the code multiple times. Any way I will add another approach with regex. – Mazdak Dec 14 '15 at 17:03
@VincentSavard Thanks for your attention, and reminding! – Mazdak Dec 14 '15 at 17:09

score 9 · Answer 3 · edited May 23 '17 at 10:29

9

Using back reference, it is very easy:

import re
p = re.compile(ur'([a-z])\1{1,}')
re.findall(p, u"ugknbfddgicrmopn")
#output: [u'd']
re.findall(p,"abbbcppq")
#output: ['b', 'p']

For more details, you can refer to a similar question in perl: Regular expression to match any character being repeated more than 10 times

edited May 23 '17 at 10:29

Community

1
1

answered Dec 14 '15 at 07:08

Gurupad Hegde

2,155
15
30

9

`\1{1,}` would be written as `\1+` – Avinash Raj Dec 14 '15 at 08:52

score 5 · Answer 4 · answered Dec 14 '15 at 11:04

5

It is pretty easy without regular expressions:

In [4]: [k for k, v in collections.Counter("abracadabra").items() if v==2]
Out[4]: ['b', 'r']

answered Dec 14 '15 at 11:04

Dima Tisnek

11,241
4
68
120

Hmm...doesn't work if the input was `abbbbcppq`. Maybe the problem because that `if v == 2` :) – Remi Guan Dec 14 '15 at 11:11
Your question is somewhat ambiguous: are we looking for all letters that appear more than once or only those that appear exactly twice in the whole input? This answer is accurate for the latter, but for the former `[k for k, v in collections.Counter("abbbbcppq").items() if v>1]` will do. – MartyMacGyver Dec 22 '15 at 18:31

xhg · Answer 5 · 2015-12-14T07:18:48.627

4

Maybe you can use the generator to achieve this

def adj(s):
    last_c = None
    for c in s:
        if c == last_c:
            yield c * 2
        last_c = c

s = 'ugknbfddgicrmopn'
v = [x for x in adj(s)]
print(v)
# output: ['dd']

edited Dec 14 '15 at 07:18

answered Dec 14 '15 at 07:14

xhg

1,850
2
21
35

score 3 · Answer 6 · answered Dec 14 '15 at 07:48

"or maybe there's some better ways"

Since regex is often misunderstood by the next developer to encounter your code (may even be you), And since simpler != shorter,

How about the following pseudo-code:

function findMultipleLetters(inputString) {        
    foreach (letter in inputString) {
        dictionaryOfLettersOccurrance[letter]++;
        if (dictionaryOfLettersOccurrance[letter] == 2) {
            multipleLetters.add(letter);
        }
    }
    return multipleLetters;
}
multipleLetters = findMultipleLetters("ugknbfddgicrmopn");

score 2 · Answer 7 · edited Dec 14 '15 at 08:56

2

A1 = "abcdededdssffffccfxx"

print A1[1]
for i in range(len(A1)-1):
    if A1[i+1] == A1[i]:
        if not A1[i+1] == A1[i-1]:
            print A1[i] *2

edited Dec 14 '15 at 08:56

Remi Guan

21,506
17
64
87

answered Dec 14 '15 at 07:17

Mark White

640
1
5
12

7

Welcome to SO! When answering, also add explanation of the code. – Tushar Dec 14 '15 at 07:18
In this case if I have `ffff`, then the output would be `['dd', 'ss', 'ff', 'ff', 'ff']`. – Remi Guan Dec 14 '15 at 07:23
Actually...this still doesn't catch `'ff', 'ff'` as I said in comments. – Remi Guan Dec 14 '15 at 08:57

score 0 · Answer 8 · answered Dec 14 '15 at 07:13

0

>>> l = ['ug', 'kn', 'bf', 'dd', 'gi', 'cr', 'mo', 'pn']
>>> import re
>>> newList = [item for item in l if re.search(r"([a-z]{1})\1", item)]
>>> newList
['dd']

answered Dec 14 '15 at 07:13

Mayur Koshti

1,794
15
20

What is the use if you give a list of items? this will not work for other strings. – Rohan Amrute Dec 14 '15 at 07:25
I have used `re.search` which works only for string. – Mayur Koshti Dec 14 '15 at 07:32
Also, it works for other strings. Like if you add item 'zz' in list then it will give both 'dd' and 'zz'. – Mayur Koshti Dec 14 '15 at 07:33
What i am saying is you provided a predefined list. So it will match from list and you have given all the list items of length 2. So your program is not flexible. Given a string it will not give the required output. I am just saying that the input is in the form of a `String` not `list`. – Rohan Amrute Dec 14 '15 at 07:35

Find "one letter that appears twice" in a string

8 Answers8

Benchmark with RegEx recipe:

Benchmark with `RegEx` recipe: