How to code a regex pattern for multiple identical characters in python3?

Question

I have a long string of the following form:

joined_string = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI..."

it is a concatenation of random strings interspersed by strings of consecutive F letters:

ASOGH
FFFFFFFFFFFFFFFFFFF
GFIOSG
FFFFFFFF
URHDHREEK
FFFFFF
IIIEI

The number of consecutive F letters is not fixed, but there will be more than 5, and lets assume five F letters will not appear in random strings consecutively.

I want to extract only random strings to get the following list:

random_strings = ['ASOGH', 'GFIOSG', 'URHDHREEK', 'IIIEI']

I imagine there is a simple regex expression that would solve this task:

random_strings = joined_string.split('WHAT_TO_TYPE_HERE?')

Question: how to code a regex pattern for multiple identical characters?

Does this answer your question? [Split string based on regex](https://stackoverflow.com/questions/13209288/split-string-based-on-regex) — Jan Wilamowski, Jul 08 '21 at 08:17
`str.split` cannot take a regex so use the `re` module with the pattern `F+` — Jan Wilamowski, Jul 08 '21 at 08:18

score 1 · Answer 1 · answered Jul 08 '21 at 08:22

You can use split using F{5,} and keep this in capture group so that split text is also part of result:

import re
s = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI"
print( re.split(r'(F{5,})', s) )

Output:

['ASOGH', 'FFFFFFFFFFFFFFFFFFF', 'GFIOSG', 'FFFFFFFF', 'URHDHREEK', 'FFFFFF', 'IIIEI']

score 1 · Accepted Answer · answered Jul 08 '21 at 08:23

I would use re.split for this task following way

import re
joined_string = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI"
parts = re.split('F{5,}',joined_string)
print(parts)

output

['ASOGH', 'GFIOSG', 'URHDHREEK', 'IIIEI']

F{5,} denotes 5 or more F

score 0 · Answer 3 · answered Jul 08 '21 at 08:16

I would use a regex find all approach here:

joined_string = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI"
parts = re.findall(r'F{2,}|(?:[A-EG-Z]|F(?!F))+', joined_string)
print(parts)

This prints:

['ASOGH', 'FFFFFFFFFFFFFFFFFFF', 'GFIOSG', 'FFFFFFFF', 'URHDHREEK', 'FFFFFF', 'IIIEI']

The regex pattern here can be explained as:

F{2,}         match any group of 2 or more consecutive F's (first)
|             OR, that failing
(?:
    [A-EG-Z]  match any non F character
    |         OR
    F(?!F)    match a single F (not followed by an F)
)+            all of these, one or more times

How to code a regex pattern for multiple identical characters in python3?

3 Answers3