0

Using python, I am trying to find any sequence of characters in a string by specifying the length of this chain of characters.

For Example, if we have the following variable, I want to extract any identical sequence of characters with a length of 5:

x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"

the result should be:

11111
11111

how can I do that?

ALBADI
  • 315
  • 2
  • 8
  • 17
  • 1
    `Counter` could be your friend. – DirtyBit Feb 19 '19 at 14:14
  • You should use regex to match a repeated expression. This post should help: https://stackoverflow.com/a/1660739/7692562 – foobarbaz Feb 19 '19 at 14:15
  • @user5173426, can you elaborate? `Counter` by itself doesn't tell you anything about consecutive runs of identical characters. – Kevin Feb 19 '19 at 14:16
  • This is not a do-your-homework-for-you site and neither it is tutorial site for people who don't know any programming at all. To ensure an answer please show what you have tried yourself so far. – Gnudiff Feb 19 '19 at 14:16
  • 2
    @user5173426 `Counter` is not useful here because the characters have to be adjacent, `itertools.groupby` could be used though – Chris_Rands Feb 19 '19 at 14:16
  • @jdehesa Why did you delete your answer? – DirtyBit Feb 19 '19 at 14:27
  • 1
    @user5173426 I think I misunderstood the OP, I think they mean "identify sequences of `n` identical characters, no "identify identical `n`-long sequences within the string". – jdehesa Feb 19 '19 at 14:28
  • Better get my coffee. – DirtyBit Feb 19 '19 at 14:29

6 Answers6

3

itertools to the rescue :)

>>> import itertools
>>> val = 5
>>> x
'jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111'
>>> [y[0]*val for y in itertools.groupby(x) if len(list(y[1])) == val]
['11111', '11111']

Edit: naming well

>>> [char*val for char,grouper in itertools.groupby(x) if len(list(grouper)) == val]
['11111', '11111']

Or the more memory efficient oneliner suggested by @Chris_Rands

>>> [k*val for k, g in itertools.groupby(x) if sum(1 for _ in g) == val]
han solo
  • 6,390
  • 1
  • 15
  • 19
2

Or if you are fine with using regex, makes your code a lot cleaner:

[row[0] for row in re.findall(r'((.)\2{4,})', s)]

regex101 - example

RnD
  • 1,019
  • 5
  • 23
  • 49
  • 1
    This pattern does indeed match the sequence that OP is looking for. But `search` only finds the first instance. Is it possible to find _all_ instances? – Kevin Feb 19 '19 at 14:27
  • @hansolo, that works for the OP's sample input, but I think that he also wants sequences that don't contain the character "1". For example, `"22222 foo QQQQQ"` should return `["22222", "QQQQQ"]` – Kevin Feb 19 '19 at 14:33
  • @Kevin Then something like `', '.join(y*5 for y in re.findall(r'(.)\1{4}', x))` – han solo Feb 19 '19 at 14:35
  • Looking good, now :-) I was hoping there would be a findall-based solution that captures only and exactly the full sequences, so that no list comp would be required. But I don't think you can match the sequence without capturing the first character by itself. – Kevin Feb 19 '19 at 14:45
1

The original answer (below) is for a different problem (identifying repeated patterns of n characters in the string). Here is one possible one liner to solve the problem:

x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"
n = 5
res = [x[i:i + n] for i, c in enumerate(x) if x[i:i + n] == c * n]
print(res)
# ['11111', '11111']

Original (wrong) answer

Using Counter:

from collections import Counter

x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"
n = 5
c = Counter(x[i:i + n] for i in range(len(x) - n + 1))
for k, v in c.items():
    if v > 1:
        print(*([k] * v), sep='\n')

Output:

**111
**111
*1111
*1111
11111
11111
1111*
1111*
111**
111**
jdehesa
  • 58,456
  • 7
  • 77
  • 121
1

Very ugly solution :-)

x = "jhg**11111**jjhgj**11111**klhhkjh22222jhjkh1111"
for c, i in enumerate(x):
    if i == x[c+1:c+2] and i == x[c+2:c+3] and i == x[c+3:c+4] and i == x[c+4:c+5]:
        print(x[c:c+5])
Xenobiologist
  • 2,091
  • 1
  • 12
  • 16
0

try this:

x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"

seq_length = 5

for item in set(x):
    if seq_length*item in x:
        for i in range(x.count(seq_length*item)):
            print(seq_length*item)

it works by leveraging set() to easily construct the sequence you're looking for and then searches for it in the text

outputs your desired output:

11111
11111
vencaslac
  • 2,727
  • 1
  • 18
  • 29
0

Let's change a little your source string:

x = "jhg**11111**jjhgj**22222**klhhkjh33333jhjkh44444"

The regex should be:

pat = r'(.)\1{4}'

Here you have a capturing group (a single char) and a backreference to it (4 times), so totally the same char must occur 5 times.

One variant to print the result, although less intuitive is:

res = re.findall(pat, x)
print(res)

But the above code prints:

['1', '2', '3', '4']

i.e. a list, where each position is only the capturing group (in our case the first char), not the whole match.

So I propose also the second variant, with finditer and printing both start position and the whole match:

for match in re.finditer(pat, x):
    print('{:2d}: {}'.format(match.start(), match.group()))

For the above data the result is:

 5: 11111
19: 22222
33: 33333
43: 44444
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41