2

Given a string: s = FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE

The delimiting characters are P, Q, Dand E

I want to be able to split the string on these characters.

Based on: Is it possible to split a string on multiple delimiters in order?

I have the following

def splits(s,seps):
    l,_,r = s.partition(seps[0])
    if len(seps) == 1:
        return [l,r]
    return [l] + splits(r,seps[1:])

seps = ['P', 'D', 'Q', 'E']

sequences = splits(s, seps)

This gives me:

['FFFFRRFFFFFFF',
 'PRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLF',
 'RRFRRFFFFFFFFR',
 '',
 'E']

As we can see the second entry has many P.

I want is the occurrence of characters between the last set of P, not the first occurrence (i.e., RFFFFFFFLF).

Also, the order of occurrence of the delimiting characters is not fixed.

Looking for solutions/hints on how to achieve this?

Update: Desired output, all set of strings between these delimiters (similar to the one shown) but adhering to the condition of the last occurrence as above

Update2: Expected output

['FFFFRRFFFFFFF',
 'RFFFFFFFLF',   # << this is where the output differs
 'RRFRRFFFFFFFFR',
 '',
 '']   # << the last E is 2 consecutive E with no other letters, hence should be empty
eyllanesc
  • 235,170
  • 19
  • 170
  • 241
okkhoy
  • 1,298
  • 3
  • 16
  • 29
  • 3
    What is your expected output? – Jab Jun 16 '19 at 09:29
  • Your desired output is not clear, but if you want to split on the last occurrence, try to replace ``partition`` with ``rpartition``. – AlCorreia Jun 16 '19 at 09:38
  • Sorry, the output should be all set of strings that are present between the delimiters; similar to the one given, but adhering to the condition on the last occurrence. (Updated the question) – okkhoy Jun 16 '19 at 09:43
  • Instead of describing what the output should look like, can you literally show the expected output? – Sweeper Jun 16 '19 at 09:50
  • Is it [like this](https://regex101.com/r/DWhrQM/1/) what you need? – bobble bubble Jun 16 '19 at 09:58
  • @bobblebubble No. I updated the question with the expected output – okkhoy Jun 16 '19 at 10:00

4 Answers4

2

Sounds like you want to split at sequence from first character appearance until the last.

([PDQE])(?:.*\1)?

Have a try with split pattern at regex101 and a PHP Demo at 3v4l.org (should be similar in Python).

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
1
import re

s = "FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE"

def get_sequences(s):
    seen_delimiters = {c: ('', None) for c in 'PDQE'}
    order = 0
    for g in re.finditer(r'(.*?)([PDQE]|\Z)', s):
        if g[2]:
            if seen_delimiters[g[2][0]][1] == None:
                seen_delimiters[g[2][0]] = (g[1], order)
                order += 1
    return seen_delimiters

for k, (seq, order) in get_sequences(s).items():
    print('{}: order: {} seq: {}'.format(k, order, seq))

Prints:

P: order: 0 seq: FFFFRRFFFFFFF
D: order: 1 seq: RFFFFFFFLF
Q: order: 2 seq: RRFRRFFFFFFFFR
E: order: 3 seq: 

Update (for print sequences and delimiters enclosing):

import re
s = "FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE"
for g in re.finditer(r'(.*?)([PDQE]+|\Z)', s):
    print(g[1], g[2])

Prints:

FFFFRRFFFFFFF PP
RRRRRRLLRLLRLLL PP
F PP
L PP
L PP
LF PP
FF P
FLR P
FFRRLLR P
F P
RFFFFFFFLF D
RRFRRFFFFFFFFR QEE
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • This seems to be doing what I want. Let me test it with other inputs. I think there was a brief post you made earlier that captured the sequence of characters and the delimiter enclosing it. Can you add that as well to your answer, if possible? Thanks! – okkhoy Jun 16 '19 at 10:28
0

Use re.split with a character class [PQDE]:

import re

s = 'FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE'    
sequences = re.split(r'[PQDE]', s)
print(sequences)

Output:

['FFFFRRFFFFFFF', '', 'RRRRRRLLRLLRLLL', '', 'F', '', 'L', '', 'L', '', 'LF', '', 'FF', 'FLR', 'FFRRLLR', 'F', 'RFFFFFFFLF', 'RRFRRFFFFFFFFR', '', '', '']

If you want to split on 1 or more delimiter:

import re

s = 'FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE'    
sequences = re.split(r'[PQDE]+', s)
print(sequences)

Output:

['FFFFRRFFFFFFF', 'RRRRRRLLRLLRLLL', 'F', 'L', 'L', 'LF', 'FF', 'FLR', 'FFRRLLR', 'F', 'RFFFFFFFLF', 'RRFRRFFFFFFFFR', '']

If you want to capture the delimiters:

import re

s = 'FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE'    
sequences = re.split(r'([PQDE])', s)
print(sequences)

Output:

['FFFFRRFFFFFFF', 'P', '', 'P', 'RRRRRRLLRLLRLLL', 'P', '', 'P', 'F', 'P', '', 'P', 'L', 'P', '', 'P', 'L', 'P', '', 'P', 'LF', 'P', '', 'P', 'FF', 'P', 'FLR', 'P', 'FFRRLLR', 'P', 'F', 'P', 'RFFFFFFFLF', 'D', 'RRFRRFFFFFFFFR', 'Q', '', 'E', '', 'E', '']
Toto
  • 89,455
  • 62
  • 89
  • 125
0

This solution is iterating the delimiters one by one, so you can control the order you want to apply each one of them:

s = 'FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE'
spliters='PDQE'
for sp in spliters:
    if type(s) is str:
        s = s.split(sp)
    else: #type is list
        s=[x.split(sp) for x in s]
        s = [item for sublist in s for item in sublist if item != ''] #flatten the list

output:

['FFFFRRFFFFFFF',
 'RRRRRRLLRLLRLLL',
 'F',
 'L',
 'L',
 'LF',
 'FF',
 'FLR',
 'FFRRLLR',
 'F',
 'RFFFFFFFLF',
 'RRFRRFFFFFFFFR']
Binyamin Even
  • 3,318
  • 1
  • 18
  • 45