12

Objective: I am trying to perform a cut in Python RegEx where split doesn't quite do what I want. I need to cut within a pattern, but between characters.

What I am looking for:

I need to recognize the pattern below in a string, and split the string at the location of the pipe. The pipe isn't actually in the string, it just shows where I want to split.

Pattern: CDE|FG

String: ABCDEFGHIJKLMNOCDEFGZYPE

Results: ['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']

What I have tried:

I seems like using split with parenthesis is close, but it doesn't keep the search pattern attached to the results like I need it to.

re.split('CDE()FG', 'ABCDEFGHIJKLMNOCDEFGZYPE')

Gives,

['AB', 'HIJKLMNO', 'ZYPE']

When I actually need,

['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']

Motivation:

Practicing with RegEx, and wanted to see if I could use RegEx to make a script that would predict the fragments of a protein digestion using specific proteases.

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
Michael Molter
  • 1,296
  • 2
  • 14
  • 37

4 Answers4

8

A non regex way would be to replace the pattern with the piped value and then split.

>>> pattern = 'CDE|FG'
>>> s = 'ABCDEFGHIJKLMNOCDEFGZYPE'
>>> s.replace('CDEFG',pattern).split('|')
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
  • 1
    Unlike the regex, this allows you to split the string on many different patterns easily. However, it will produce an unwanted result if the control character you introduce is already in use in the file. (in this case, the pipe) – A-y Jun 20 '16 at 18:10
  • 2
    @Yab Exactly. The answer is a *faster* alternative to the regex. The OP mentions that they are open to non-regex answers in a comment and hence the answer. The *pipe* here is assumed to not be present in the dataset. Usually in such cases a multiple delimiter containing special characters and unicode literals are used. I haven't used that to demonstrate a simple use case. – Bhargav Rao Jun 20 '16 at 18:16
  • 1
    For cases where you _need_ a regexp, you can use the same approach with `re.sub`. For example, `re.sub(r"(CD[xy])(FG)", r"\1|\2", data)`. – alexis Jun 22 '16 at 14:04
  • Yep, For this case in particular we can use `re.sub(r"(CDE)(FG)", r"\1|\2", s).split('|')`. – Bhargav Rao Jun 22 '16 at 14:07
5

You can solve it with re.split() and positive "look arounds":

>>> re.split(r"(?<=CDE)(\w+)(?=FG)", s)
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']

Note that if one of the cut sequences is an empty string, you would get an empty string inside the resulting list. You can handle that "manually", sample (I admit, it is not that pretty):

import re

s = "ABCDEFGHIJKLMNOCDEFGZYPE"

cut_sequences = [
    ["CDE", "FG"],
    ["FGHI", ""],
    ["", "FGHI"]
]

for left, right in cut_sequences:
    items = re.split(r"(?<={left})(\w+)(?={right})".format(left=left, right=right), s)

    if not left:
        items = items[1:]

    if not right:
        items = items[:-1]

    print(items)

Prints:

['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
['ABCDEFGHI', 'JKLMNOCDEFGZYPE']
['ABCDE', 'FGHIJKLMNOCDEFGZYPE']
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I like this solution a lot, and it does what I asked, but when trying to generalize, I can't get a cut sequence like this to work `|FGHI`. – Michael Molter Jun 20 '16 at 18:04
  • @MichaelMolter yeah, you'll get the extra empty string as the first split item, right? I'm afraid you would have to handle the empty split delimiters case like `|FGHI` or `FGHI|` manually slicing the result of `re.split()`: `[1:]` and `[:-1]` respectively. May be there is a more elegant way to handle that..thanks. – alecxe Jun 20 '16 at 18:10
  • This solution is so wrong: It will happily split `"ABCDExxxxxxxFGH"` into three pieces, and it will *not* split correctly if there are three correct cut points, etc. – alexis Jun 20 '16 at 18:27
  • 1
    @alexis well, 3 cut points..of course, that was not in the OP's question. I can break this solution easily in so many ways and I am not pretending to provide a universal solution here, I am just happy to be helpful. Thanks. – alecxe Jun 20 '16 at 18:30
  • @alexce, aren't you trivializing the problem a bit? You focus on the OP's trivial example and ignore the description, which makes it clear that CDE-FG are adjacent and that there's an indeterminate number of splits. I expected you would fix or delete the answer, not just shrug. – alexis Jun 20 '16 at 20:16
  • @alexis okay, I see your point and I'm ready to improve the answer. Could you provide me a sample string that would fit the OP's requirements and where the provided solution would not work? Thank you. – alecxe Jun 21 '16 at 00:01
2

To keep the splitting pattern when you split with re.split, or parts of it, enclose them in parentheses.

>>> data
'ABCDEFGHIJKLMNOCDEFGZYPE'
>>> pieces = re.split(r"(CDE)(FG)", data)
>>> pieces
['AB', 'CDE', 'FG', 'HIJKLMNO', 'CDE', 'FG', 'ZYPE']

Easy enough. All the parts are there, but as you can see they have been separated. So we need to reassemble them. That's the trickier part. Look carefully and you'll see you need to join the first two pieces, the last two pieces, and the rest in triples. I simplify the code by padding the list, but you could do it with the original list (and a bit of extra code) if performance is a problem.

>>> pieces = [""] + pieces
>>> [ "".join(pieces[i:i+3]) for i in range(0,len(pieces), 3) ]
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']

re.split() guarantees a piece for every capturing (parenthesized) group, plus a piece for what's between. With more complex regular expressions that need their own grouping, use non-capturing groups to keep the format of the returned data the same. (Otherwise you'll need to adapt the reassembly step.)

PS. I also like Bhargav Rao's suggestion to insert a separator character in the string. If performance is not an issue, I guess it's a matter of taste.

Edit: Here's a (less transparent) way to do it without adding an empty string to the list:

pieces = re.split(r"(CDE)(FG)", data)
result = [ "".join(pieces[max(i-3,0):i]) for i in range(2,len(pieces)+2, 3) ]
alexis
  • 48,685
  • 16
  • 101
  • 161
1

A safer non-regex solution could be this:

import re

def split(string, pattern):
    """Split the given string in the place indicated by a pipe (|) in the pattern"""
    safe_splitter = "#@#@SPLIT_HERE@#@#"
    safe_pattern = pattern.replace("|", safe_splitter)
    string = string.replace(pattern.replace("|", ""), safe_pattern)
    return string.split(safe_splitter)

s = "ABCDEFGHIJKLMNOCDEFGZYPE"
print(split(s, "CDE|FG"))
print(split(s, "|FG"))
print(split(s, "FGH|"))

https://repl.it/C448

Urban
  • 399
  • 1
  • 5
  • 11