Split a string with different condition without removing the character in python

Question

I have a string with parameters in it:

text =  "Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 0, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]"

I want to remove spaces to obtain all parameters individually in the following way:

pred_res = ["Uncertain significance","PVS1=0","PS=[0, 0, 0, 0, 0]","PM=[0, 0, 0, 0, 0, 0, 0]","PP=[0, 0, 0, 0, 0, 0]","BA1=0","BS=[0, 0, 0, 0, 0]","BP=[0, 0, 0, 0, 0, 0, 0, 0]"]

So far I have used this regex pattern:

pat = re.compile('[a-z]\s[A-Z]|[0-9]\s[A-Z]|]\s[A-Z]')

But it's giving me the result in the following way where it removes characters:

res = ["Uncertain significanc","VS1=","S=[0, 0, 0, 0, 0","M=[0, 0, 0, 0, 0, 0, 0","P=[0, 0, 0, 0, 0, 0","A1=","S=[0, 0, 0, 0, 0","P=[0, 0, 0, 0, 0, 0, 0, 0]"]

So is there a way to prevent this and obtain the result shown in pred_res?

So you want (list of words) OR (XX=[...]) ? Also you didn't show the regex method that you used on your `pat` pattern — azro, Apr 27 '21 at 11:00
@azro I want the result like **pred_res**. And i used the pattern in Series.str.split(). As i have column with data like variable text. I just used a single example. — Nikhil Panchal, Apr 27 '21 at 11:07

score 4 · Accepted Answer · answered Apr 27 '21 at 11:04

4

You can use a lookahead to check that there is an = in the text immediately following a space.

import re
text = 'Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 0, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]'
pred_res = re.split(r' (?=\w+=)', text)
print(pred_res)
# ['Uncertain significance', 'PVS1=0', 'PS=[0, 0, 0, 0, 0]', 'PM=[0, 0, 0, 0, 0, 0, 0]', 'PP=[0, 0, 0, 0, 0, 0]', 'BA1=0', 'BS=[0, 0, 0, 0, 0]', 'BP=[0, 0, 0, 0, 0, 0, 0, 0]']

answered Apr 27 '21 at 11:04

Nick is tired

6,860
20
39
51

Thank you so much it worked like magic. If its not much trouble can you can explain how did you come up with this pattern? – Nikhil Panchal Apr 27 '21 at 11:22
1

@NikhilPanchal There's a brief description of lookaheads here: [Regex lookahead, lookbehind and atomic groups](https://stackoverflow.com/questions/2973436/regex-lookahead-lookbehind-and-atomic-groups), but ultimately they're just something you learn at some point, the example used here allows you to search for a string which has another string following (*"Look ahead positive `(?=)`"* on the post linked). The reason I went for a space with = in the following text was that that is where all the splits occured in the string in your example. – Nick is tired Apr 27 '21 at 11:28

The fourth bird · Answer 2 · 2021-04-27T11:15:28.750

Another option could be matching all the separate parts.

\w+=(?:\[[^][]*]|[^][\s]+)|\w+(?: \w+)*(?= \w+=|$)

\w+= Match 1+ word char followed by =
(?: Non capture group
- \[[^][]*] match from [ till ]
- | Or
- [^][\s]+ Match any char except a whitespace char or char [ and ]
) Close the group
| or
\w+(?: \w+)*(?= \w+=|$) Match word chars optionally repeated by a space and word chars asserting word chars followed by = or the end of the string at the right

Regex demo

import re

s = "Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 0, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]"
pattern = r"\w+=(?:\[[^][]*]|[^][\s]+)|\w+(?: \w+)*(?= \w+=|$)"

pred_res = re.findall(pattern, s)
print(pred_res)

Output

['Uncertain significance', 'PVS1=0', 'PS=[0, 0, 0, 0, 0]', 'PM=[0, 0, 0, 0, 0, 0, 0]', 'PP=[0, 0, 0, 0, 0, 0]', 'BA1=0', 'BS=[0, 0, 0, 0, 0]', 'BP=[0, 0, 0, 0, 0, 0, 0, 0]']

score 1 · Answer 3 · answered Apr 27 '21 at 22:29

Use

\s+(?=[A-Z])

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    [A-Z]                    any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
  )                        end of look-ahead

Python code:

import re
test_str = 'Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 0, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]'
matches = re.split(r'\s+(?=[A-Z])', test_str)
print(matches)

Results:

['Uncertain significance', 'PVS1=0', 'PS=[0, 0, 0, 0, 0]', 'PM=[0, 0, 0, 0, 0, 0, 0]', 'PP=[0, 0, 0, 0, 0, 0]', 'BA1=0', 'BS=[0, 0, 0, 0, 0]', 'BP=[0, 0, 0, 0, 0, 0, 0, 0]']

That's one beautiful explanation :o – Nikhil Panchal Apr 28 '21 at 11:19 — Nikhil Panchal, Apr 28 '21 at 11:19

Split a string with different condition without removing the character in python

3 Answers3