0

I have a long string of the following form:

joined_string = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI..."

it is a concatenation of random strings interspersed by strings of consecutive F letters:

ASOGH
FFFFFFFFFFFFFFFFFFF
GFIOSG
FFFFFFFF
URHDHREEK
FFFFFF
IIIEI

The number of consecutive F letters is not fixed, but there will be more than 5, and lets assume five F letters will not appear in random strings consecutively.

I want to extract only random strings to get the following list:

random_strings = ['ASOGH', 'GFIOSG', 'URHDHREEK', 'IIIEI']

I imagine there is a simple regex expression that would solve this task:

random_strings = joined_string.split('WHAT_TO_TYPE_HERE?')

Question: how to code a regex pattern for multiple identical characters?

mercury0114
  • 1,341
  • 2
  • 15
  • 29

3 Answers3

1

You can use split using F{5,} and keep this in capture group so that split text is also part of result:

import re
s = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI"
print( re.split(r'(F{5,})', s) )

Output:

['ASOGH', 'FFFFFFFFFFFFFFFFFFF', 'GFIOSG', 'FFFFFFFF', 'URHDHREEK', 'FFFFFF', 'IIIEI']

anubhava
  • 761,203
  • 64
  • 569
  • 643
1

I would use re.split for this task following way

import re
joined_string = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI"
parts = re.split('F{5,}',joined_string)
print(parts)

output

['ASOGH', 'GFIOSG', 'URHDHREEK', 'IIIEI']

F{5,} denotes 5 or more F

Daweo
  • 31,313
  • 3
  • 12
  • 25
0

I would use a regex find all approach here:

joined_string = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI"
parts = re.findall(r'F{2,}|(?:[A-EG-Z]|F(?!F))+', joined_string)
print(parts)

This prints:

['ASOGH', 'FFFFFFFFFFFFFFFFFFF', 'GFIOSG', 'FFFFFFFF', 'URHDHREEK', 'FFFFFF', 'IIIEI']

The regex pattern here can be explained as:

F{2,}         match any group of 2 or more consecutive F's (first)
|             OR, that failing
(?:
    [A-EG-Z]  match any non F character
    |         OR
    F(?!F)    match a single F (not followed by an F)
)+            all of these, one or more times
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360