0

I have several strings of the form "[one or more words] [number] [one or more words]" and I want to split them into the two strings and the number. For example, if the string is:

"A sample string 20 something"

I want to obtain:

str1 = "A sample string"
numb = 20
str2 = 'something'

I have (almost) achieved my goal with the following code:

for s in row.split():
    if s.isdigit():
        quants = s
temp = row.split("{}".format(quants))
str1 = temp[0].strip()
str2 = temp[1].strip()

this works fine for most cases. However, there are 2 exceptions which I'm not able to handle:

  1. If the number is inside brackets, I want it to be counted as a string. For example:

    "Some text (just as 1 example) 2 more words"

    I want to have str1 = "Some text (just as 1 example)"

  2. Sometimes the number is expressed in special character (Unicode?), ¼, ½ and ¾. How do I account for these?

I suspect the answer is to use regular expressions rather than a delimiter, but I could not really grasp how to use them yet.

gicanzo
  • 69
  • 8
  • Some remarks: 1) `split()` will not help you; your solution just works for single-digit numbers 2) The requirement to consider numbers in brackets as text requires a different approach. 3) With a proper encoding statement, you can include fractions the same way in code as in your question. – guidot Jun 24 '20 at 13:48
  • 1) actually my solution works also with numbers that have more than 1 digit, as long as there are no spaces in between them, which satisfies the characteristics of the strings I have to deal with. 2) what approach? 3) what do you mean by encoding statement? Those are not just fractions, but unicode(?) characters (i.e. ¼ not 1/4) – gicanzo Jun 24 '20 at 15:24
  • Sorry, I've edited my answer to fit your needs better, except for the special characters, do you mean `½` = 1/2? – Hagi Jun 24 '20 at 19:42

2 Answers2

1

You might use a regex with 3 capturing groups, and then get the values of the groups.

^(\w+(?: \w+)*(?: \([^()]*\))?) (\d+|[¼½¾]) (\w+(?: \w+)*)$

Explanation

  • ^ Start of string
  • ( Capture group 1
    • \w+(?: \w+)* Match 1+ word chars, optionally repeated by space and 1+ word chars
    • (?: \([^()]*\))? Optionally match a space and form opening till closing parenthesis
  • ) Close group and match the space
  • (\d+|[¼½¾]) Capture group 2 Match 1+ digits or 1 of the listed ¼½¾ and space
  • (\w+(?: \w+)*) Capture group 3 Match 1+ word chars, and optionally repeat that preceded by space
  • $ End of string

Regex demo | Python demo

Example code

import re

regex = r"(\w+(?: \w+)*(?: \([^()]*\))?) (\d+|[¼½¾]) (\w+(?: \w+)*)"
s = "Some text (just as 1 example) ¼ more words"
match = re.match(regex, s)
if match:
    print(match.group(1))
    print(match.group(2))
    print(match.group(3))

Output

Some text (just as 1 example)
¼
more words

A bit broader pattern that uses .* to match any char except a newline instead of using \w+

^(.*(?:\([^()]*\))?) (\d+|[¼½¾]) (.+)

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
0

I adjusted this answer by replacing the regex by numbers

import re
l = re.compile("((?<=\d)(?=\D)|(?=\d)(?<=\D))(?![^\(]*\))").split(test)

gave me this:

['A sample string ', '20', ' something']

Tested here: https://regex101.com/r/zT2dF9/53

where as ((?<=\d)(?=\D)|(?=\d)(?<=\D)) can be separated into two steps: (?<=\d)(?=\D) is a lookbehind of digits (done by (?<=\d)) followed by non-digits of any kind `(?=\D)'

And vice versa (?=\d)(?<=\D) searches for non-digits followed by digits

This is concatenated by (?![^\(]*\)) to ignore the content in brackets

For detailed information about the regex literals, check this out: https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285

Hagi
  • 140
  • 1
  • 8
  • would you be so kind to explain this? My code can already decompose a string in this way. Can this code handle the two exceptions I mentioned above? – gicanzo Jun 24 '20 at 15:21