How to account for special characters when splitting string in python

Question

I have several strings of the form "[one or more words] [number] [one or more words]" and I want to split them into the two strings and the number. For example, if the string is:

"A sample string 20 something"

I want to obtain:

str1 = "A sample string"
numb = 20
str2 = 'something'

I have (almost) achieved my goal with the following code:

for s in row.split():
    if s.isdigit():
        quants = s
temp = row.split("{}".format(quants))
str1 = temp[0].strip()
str2 = temp[1].strip()

this works fine for most cases. However, there are 2 exceptions which I'm not able to handle:

If the number is inside brackets, I want it to be counted as a string. For example:

"Some text (just as 1 example) 2 more words"

I want to have str1 = "Some text (just as 1 example)"
Sometimes the number is expressed in special character (Unicode?), ¼, ½ and ¾. How do I account for these?

I suspect the answer is to use regular expressions rather than a delimiter, but I could not really grasp how to use them yet.

Some remarks: 1) `split()` will not help you; your solution just works for single-digit numbers 2) The requirement to consider numbers in brackets as text requires a different approach. 3) With a proper encoding statement, you can include fractions the same way in code as in your question. — guidot, Jun 24 '20 at 13:48
1) actually my solution works also with numbers that have more than 1 digit, as long as there are no spaces in between them, which satisfies the characteristics of the strings I have to deal with. 2) what approach? 3) what do you mean by encoding statement? Those are not just fractions, but unicode(?) characters (i.e. ¼ not 1/4) — gicanzo, Jun 24 '20 at 15:24
Sorry, I've edited my answer to fit your needs better, except for the special characters, do you mean `½` = 1/2? — Hagi, Jun 24 '20 at 19:42

score 1 · Accepted Answer · answered Jun 24 '20 at 20:45

You might use a regex with 3 capturing groups, and then get the values of the groups.

^(\w+(?: \w+)*(?: \([^()]*\))?) (\d+|[¼½¾]) (\w+(?: \w+)*)$

Explanation

^ Start of string
( Capture group 1
- \w+(?: \w+)* Match 1+ word chars, optionally repeated by space and 1+ word chars
- (?: $[^()]*$)? Optionally match a space and form opening till closing parenthesis
) Close group and match the space
(\d+|[¼½¾]) Capture group 2 Match 1+ digits or 1 of the listed ¼½¾ and space
(\w+(?: \w+)*) Capture group 3 Match 1+ word chars, and optionally repeat that preceded by space
$ End of string

Regex demo | Python demo

Example code

import re

regex = r"(\w+(?: \w+)*(?: \([^()]*\))?) (\d+|[¼½¾]) (\w+(?: \w+)*)"
s = "Some text (just as 1 example) ¼ more words"
match = re.match(regex, s)
if match:
    print(match.group(1))
    print(match.group(2))
    print(match.group(3))

Output

Some text (just as 1 example)
¼
more words

A bit broader pattern that uses .* to match any char except a newline instead of using \w+

^(.*(?:\([^()]*\))?) (\d+|[¼½¾]) (.+)

Regex demo

Perfect! Also the regex demo tool is amazing. Thank you very much! — gicanzo, Jun 25 '20 at 23:02

Hagi · Answer 2 · 2020-06-24T19:39:50.330

I adjusted this answer by replacing the regex by numbers

import re
l = re.compile("((?<=\d)(?=\D)|(?=\d)(?<=\D))(?![^\(]*\))").split(test)

gave me this:

['A sample string ', '20', ' something']

Tested here: https://regex101.com/r/zT2dF9/53

where as ((?<=\d)(?=\D)|(?=\d)(?<=\D)) can be separated into two steps: (?<=\d)(?=\D) is a lookbehind of digits (done by (?<=\d)) followed by non-digits of any kind `(?=\D)'

And vice versa (?=\d)(?<=\D) searches for non-digits followed by digits

This is concatenated by (?![^$]*$) to ignore the content in brackets

For detailed information about the regex literals, check this out: https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285

would you be so kind to explain this? My code can already decompose a string in this way. Can this code handle the two exceptions I mentioned above? — gicanzo, Jun 24 '20 at 15:21

How to account for special characters when splitting string in python

2 Answers2