Finding a sequence of characters in string

Question

Using python, I am trying to find any sequence of characters in a string by specifying the length of this chain of characters.

For Example, if we have the following variable, I want to extract any identical sequence of characters with a length of 5:

x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"

the result should be:

11111
11111

how can I do that?

You should use regex to match a repeated expression. This post should help: https://stackoverflow.com/a/1660739/7692562 — foobarbaz, Feb 19 '19 at 14:15
@user5173426, can you elaborate? `Counter` by itself doesn't tell you anything about consecutive runs of identical characters. — Kevin, Feb 19 '19 at 14:16
This is not a do-your-homework-for-you site and neither it is tutorial site for people who don't know any programming at all. To ensure an answer please show what you have tried yourself so far. — Gnudiff, Feb 19 '19 at 14:16
@user5173426 `Counter` is not useful here because the characters have to be adjacent, `itertools.groupby` could be used though — Chris_Rands, Feb 19 '19 at 14:16
@user5173426 I think I misunderstood the OP, I think they mean "identify sequences of `n` identical characters, no "identify identical `n`-long sequences within the string". — jdehesa, Feb 19 '19 at 14:28

han solo · Answer 1 · 2019-02-19T14:39:03.030

itertools to the rescue :)

>>> import itertools
>>> val = 5
>>> x
'jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111'
>>> [y[0]*val for y in itertools.groupby(x) if len(list(y[1])) == val]
['11111', '11111']

Edit: naming well

>>> [char*val for char,grouper in itertools.groupby(x) if len(list(grouper)) == val]
['11111', '11111']

Or the more memory efficient oneliner suggested by @Chris_Rands

>>> [k*val for k, g in itertools.groupby(x) if sum(1 for _ in g) == val]

RnD · Answer 2 · 2019-02-19T14:40:08.547

2

Or if you are fine with using regex, makes your code a lot cleaner:

[row[0] for row in re.findall(r'((.)\2{4,})', s)]

regex101 - example

edited Feb 19 '19 at 14:40

answered Feb 19 '19 at 14:21

RnD

1,019
5
23
49

1

This pattern does indeed match the sequence that OP is looking for. But `search` only finds the first instance. Is it possible to find _all_ instances? – Kevin Feb 19 '19 at 14:27
@hansolo, that works for the OP's sample input, but I think that he also wants sequences that don't contain the character "1". For example, `"22222 foo QQQQQ"` should return `["22222", "QQQQQ"]` – Kevin Feb 19 '19 at 14:33
@Kevin Then something like `', '.join(y*5 for y in re.findall(r'(.)\1{4}', x))` – han solo Feb 19 '19 at 14:35
Looking good, now :-) I was hoping there would be a findall-based solution that captures only and exactly the full sequences, so that no list comp would be required. But I don't think you can match the sequence without capturing the first character by itself. – Kevin Feb 19 '19 at 14:45

jdehesa · Answer 3 · 2019-02-19T14:44:31.117

The original answer (below) is for a different problem (identifying repeated patterns of n characters in the string). Here is one possible one liner to solve the problem:

x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"
n = 5
res = [x[i:i + n] for i, c in enumerate(x) if x[i:i + n] == c * n]
print(res)
# ['11111', '11111']

Original (wrong) answer

Using Counter:

from collections import Counter

x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"
n = 5
c = Counter(x[i:i + n] for i in range(len(x) - n + 1))
for k, v in c.items():
    if v > 1:
        print(*([k] * v), sep='\n')

Output:

**111
**111
*1111
*1111
11111
11111
1111*
1111*
111**
111**

although its for a different problem, I liked this. +1 – DirtyBit Feb 19 '19 at 14:39 — DirtyBit, Feb 19 '19 at 14:39

Xenobiologist · Answer 4 · 2019-02-19T14:59:51.893

1

Very ugly solution :-)

x = "jhg**11111**jjhgj**11111**klhhkjh22222jhjkh1111"
for c, i in enumerate(x):
    if i == x[c+1:c+2] and i == x[c+2:c+3] and i == x[c+3:c+4] and i == x[c+4:c+5]:
        print(x[c:c+5])

edited Feb 19 '19 at 14:59

answered Feb 19 '19 at 14:43

Xenobiologist

2,091
1
12
16

Style tip: consider using `for c, i in enumerate(x):` instead of manually incrementing a count variable. – Kevin Feb 19 '19 at 14:54
Thanks. I edited my code. Still ugly, but should work :-) – Xenobiologist Feb 19 '19 at 15:00

score 0 · Answer 5 · answered Feb 19 '19 at 14:19

try this:

x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"

seq_length = 5

for item in set(x):
    if seq_length*item in x:
        for i in range(x.count(seq_length*item)):
            print(seq_length*item)

it works by leveraging set() to easily construct the sequence you're looking for and then searches for it in the text

outputs your desired output:

11111
11111

score 0 · Answer 6 · answered Feb 19 '19 at 14:53

Let's change a little your source string:

x = "jhg**11111**jjhgj**22222**klhhkjh33333jhjkh44444"

The regex should be:

pat = r'(.)\1{4}'

Here you have a capturing group (a single char) and a backreference to it (4 times), so totally the same char must occur 5 times.

One variant to print the result, although less intuitive is:

res = re.findall(pat, x)
print(res)

But the above code prints:

['1', '2', '3', '4']

i.e. a list, where each position is only the capturing group (in our case the first char), not the whole match.

So I propose also the second variant, with finditer and printing both start position and the whole match:

for match in re.finditer(pat, x):
    print('{:2d}: {}'.format(match.start(), match.group()))

For the above data the result is:

Finding a sequence of characters in string

6 Answers6