58

I'm trying to catch if one letter that appears twice in a string using RegEx (or maybe there's some better ways?), for example my string is:

ugknbfddgicrmopn

The output would be:

dd

However, I've tried something like:

re.findall('[a-z]{2}', 'ugknbfddgicrmopn')

but in this case, it returns:

['ug', 'kn', 'bf', 'dd', 'gi', 'cr', 'mo', 'pn']   # the except output is `['dd']`

I also have a way to get the expect output:

>>> l = []
>>> tmp = None
>>> for i in 'ugknbfddgicrmopn':
...     if tmp != i:
...         tmp = i
...         continue
...     l.append(i*2)
...     
... 
>>> l
['dd']
>>> 

But that's too complex...


If it's 'abbbcppq', then only catch:

abbbcppq
 ^^  ^^

So the output is:

['bb', 'pp']

Then, if it's 'abbbbcppq', catch bb twice:

abbbbcppq
 ^^^^ ^^

So the output is:

['bb', 'bb', 'pp']
Mazdak
  • 105,000
  • 18
  • 159
  • 188
Remi Guan
  • 21,506
  • 17
  • 64
  • 87

8 Answers8

51

You need use capturing group based regex and define your regex as raw string.

>>> re.search(r'([a-z])\1', 'ugknbfddgicrmopn').group()
'dd'
>>> [i+i for i in re.findall(r'([a-z])\1', 'abbbbcppq')]
['bb', 'bb', 'pp']

or

>>> [i[0] for i in re.findall(r'(([a-z])\2)', 'abbbbcppq')]
['bb', 'bb', 'pp']

Note that , re.findall here should return the list of tuples with the characters which are matched by the first group as first element and the second group as second element. For our case chars within first group would be enough so I mentioned i[0].

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • Ohhhhh, after re-read the answer I understand how does it work now. So `([a-z])` catch the first letter, and `\1` repeat it. :) – Remi Guan Dec 14 '15 at 08:06
  • 3
    @KevinGuan ya, exactly.. `()` called capturing group. So `([a-z])` captures the first letter and the following `\1` is a back-refernce to the first capturing group. So `\1` refers all the characters which are matched by the first group. – Avinash Raj Dec 14 '15 at 08:51
32

As a Pythonic way You can use zip function within a list comprehension:

>>> s = 'abbbcppq'
>>>
>>> [i+j for i,j in zip(s,s[1:]) if i==j]
['bb', 'bb', 'pp']

If you are dealing with large string you can use iter() function to convert the string to an iterator and use itertols.tee() to create two independent iterator, then by calling the next function on second iterator consume the first item and use call the zip class (in Python 2.X use itertools.izip() which returns an iterator) with this iterators.

>>> from itertools import tee
>>> first = iter(s)
>>> second, first = tee(first)
>>> next(second)
'a'
>>> [i+j for i,j in zip(first,second) if i==j]
['bb', 'bb', 'pp']

Benchmark with RegEx recipe:

# ZIP
~ $ python -m timeit --setup "s='abbbcppq'" "[i+j for i,j in zip(s,s[1:]) if i==j]"
1000000 loops, best of 3: 1.56 usec per loop

# REGEX
~ $ python -m timeit --setup "s='abbbcppq';import re" "[i[0] for i in re.findall(r'(([a-z])\2)', 'abbbbcppq')]"
100000 loops, best of 3: 3.21 usec per loop

After your last edit as mentioned in comment if you want to only match one pair of b in strings like "abbbcppq" you can use finditer() which returns an iterator of matched objects, and extract the result with group() method:

>>> import re
>>> 
>>> s = "abbbcppq"
>>> [item.group(0) for item in re.finditer(r'([a-z])\1',s,re.I)]
['bb', 'pp']

Note that re.I is the IGNORECASE flag which makes the RegEx match the uppercase letters too.

Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • Well, then as my edit, I want `bb` from `abbbc`. Okay I know that this is a short version of mine another example and that example of mine's output isn't the expect of my edit...sorry about that... – Remi Guan Dec 14 '15 at 07:11
  • @KevinGuan In that case you need a set comprehension. – Mazdak Dec 14 '15 at 07:13
  • Well, if use `set`, then it can't catch `bb` twice as I said in comments of my question. – Remi Guan Dec 14 '15 at 07:16
  • Ah, yeah. This is good :). But however, I'd like use regex in this case. Good to know there's *another usage* of `zip()` :D – Remi Guan Dec 14 '15 at 07:20
  • @KevinGuan Yes but using `regex` is not pythonic at all, checkout the bec=nchmark result. – Mazdak Dec 14 '15 at 11:45
  • 3
    @Kasramvd: Performance isn't everything, and is most likely irrelevant in this case. Regexps are a tool to solve a problem, and they solve this problem clearly and concisely. – Vincent Savard Dec 14 '15 at 13:29
  • @VincentSavard Yep performance is not everything, but when? Actually most important points about a code is performance (in terms of memory use and run time) and then readability ( coding style, amount of code and etc.) And as you can see clearly the first approach is really more optimum and readable than regex recipe. but about the second one which is not very complicated, the main point is that it's very optimized in terms of memory use which in dealing with large data sets would be pretty much better and usable. – Mazdak Dec 14 '15 at 14:51
  • 2
    Performance is important when you profiled your program and determined which parts were bottlenecks. There's absolutely no reason to be concerned about performance in this context. Thus, readability should be the main factor, and this is completely subjective in this case. – Vincent Savard Dec 14 '15 at 14:56
  • @VincentSavard Yes, and that's what the first part does. – Mazdak Dec 14 '15 at 15:08
  • You cannot make this claim as there is no context to determine if this is a bottleneck. – Vincent Savard Dec 14 '15 at 15:09
  • @VincentSavard I'm not talking about bottlenecks, I just suggested a pythoic way, that's all and at the rest of my answer just suggested another approaches for another situations, which might be useful for OP and future readers. – Mazdak Dec 14 '15 at 15:12
  • 2
    that. because there is no reason to use a regex here. – njzk2 Dec 14 '15 at 16:37
  • 2
    @njzk2: One pretty good reason is that this does not do what OP wants for the string `abbbc` (e.g. `['bb']` for the regexp vs `['bb', 'bb']` for this code). – Vincent Savard Dec 14 '15 at 16:53
  • @VincentSavard OP has added the lat part after my answer, and make I remove my regex approach which exactly did that job (before accepted answer).actually he edited the code multiple times. Any way I will add another approach with regex. – Mazdak Dec 14 '15 at 17:03
  • @VincentSavard Thanks for your attention, and reminding! – Mazdak Dec 14 '15 at 17:09
9

Using back reference, it is very easy:

import re
p = re.compile(ur'([a-z])\1{1,}')
re.findall(p, u"ugknbfddgicrmopn")
#output: [u'd']
re.findall(p,"abbbcppq")
#output: ['b', 'p']

For more details, you can refer to a similar question in perl: Regular expression to match any character being repeated more than 10 times

Community
  • 1
  • 1
Gurupad Hegde
  • 2,155
  • 15
  • 30
5

It is pretty easy without regular expressions:

In [4]: [k for k, v in collections.Counter("abracadabra").items() if v==2]
Out[4]: ['b', 'r']
Dima Tisnek
  • 11,241
  • 4
  • 68
  • 120
  • Hmm...doesn't work if the input was `abbbbcppq`. Maybe the problem because that `if v == 2` :) – Remi Guan Dec 14 '15 at 11:11
  • Your question is somewhat ambiguous: are we looking for all letters that appear more than once or only those that appear exactly twice in the whole input? This answer is accurate for the latter, but for the former `[k for k, v in collections.Counter("abbbbcppq").items() if v>1]` will do. – MartyMacGyver Dec 22 '15 at 18:31
4

Maybe you can use the generator to achieve this

def adj(s):
    last_c = None
    for c in s:
        if c == last_c:
            yield c * 2
        last_c = c

s = 'ugknbfddgicrmopn'
v = [x for x in adj(s)]
print(v)
# output: ['dd']
xhg
  • 1,850
  • 2
  • 21
  • 35
3

"or maybe there's some better ways"

Since regex is often misunderstood by the next developer to encounter your code (may even be you), And since simpler != shorter,

How about the following pseudo-code:

function findMultipleLetters(inputString) {        
    foreach (letter in inputString) {
        dictionaryOfLettersOccurrance[letter]++;
        if (dictionaryOfLettersOccurrance[letter] == 2) {
            multipleLetters.add(letter);
        }
    }
    return multipleLetters;
}
multipleLetters = findMultipleLetters("ugknbfddgicrmopn");
Lavi Avigdor
  • 4,092
  • 3
  • 25
  • 28
2
A1 = "abcdededdssffffccfxx"

print A1[1]
for i in range(len(A1)-1):
    if A1[i+1] == A1[i]:
        if not A1[i+1] == A1[i-1]:
            print A1[i] *2
Remi Guan
  • 21,506
  • 17
  • 64
  • 87
Mark White
  • 640
  • 1
  • 5
  • 12
0
>>> l = ['ug', 'kn', 'bf', 'dd', 'gi', 'cr', 'mo', 'pn']
>>> import re
>>> newList = [item for item in l if re.search(r"([a-z]{1})\1", item)]
>>> newList
['dd']
Mayur Koshti
  • 1,794
  • 15
  • 20
  • What is the use if you give a list of items? this will not work for other strings. – Rohan Amrute Dec 14 '15 at 07:25
  • I have used `re.search` which works only for string. – Mayur Koshti Dec 14 '15 at 07:32
  • Also, it works for other strings. Like if you add item 'zz' in list then it will give both 'dd' and 'zz'. – Mayur Koshti Dec 14 '15 at 07:33
  • What i am saying is you provided a predefined list. So it will match from list and you have given all the list items of length 2. So your program is not flexible. Given a string it will not give the required output. I am just saying that the input is in the form of a `String` not `list`. – Rohan Amrute Dec 14 '15 at 07:35