Split String based on multiple Regex matches

Question

First of all, I checked these previous posts, and did not help me. 1 & 2 & 3
I have this string (or a similar case could be) that need to be handled with regex:

"Text Table 6-2: Management of children study and actions"

What I am supposed to do is detect the word Table and the word(s) before if existed
detect the numbers following and they can be in this format: 6 or 6-2 or 66-22 or 66-2
Finally the rest of the string (in this case: Management of children study and actions)

After doing so, the return value must be like this:

return 1 and 2 as one string, the rest as another string
e.g. returned value must look like this: Text Table 6-2, Management of children study and actions

Below is my code:

mystr = "Text Table 6-2:    Management of children study and actions"


if re.match("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr):
    print("True matched")
    parts_of_title = re.search("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr)
    print(parts_of_title)
    print(" ".join(parts_of_title.group().split()[0:3]), parts_of_title.group().split()[-1])

The first requirement is returned true as should be but the second doesn't so, I changed the code and used compile but the regex functionality changed, the code is like this:

mystr = "Text Table 6-2:    Management of children study and actions"


if re.match("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr):
    print("True matched")
    parts_of_title = re.compile("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?").split(mystr)
    print(parts_of_title)

Output:

True matched
['', 'Text ', 'Table', '-2', ':\tManagement of children study and actions']

So based on this, how I can achieve this and stick to a clean and readable code? and why does using compile change the matching?

The fourth bird · Accepted Answer · 2022-03-18T15:04:25.213

The matching changes because:

In the first part, you call .group().split() where .group() returns the full match which is a string.
In the second part, you call re.compile("...").split() where re.compile returns a regular expression object.

In the pattern, this part will match only a single word [a-zA-Z0-9]+[ ], and if this part should be in a capture group [0-9]([-][0-9]+)? the first (single) digit is currently not part of the capture group.

You could write the pattern writing 4 capture groups:

^(.*? )?((?:[Ll]ist|[Tt]able|[Ff]igure))\s+(\d+(?:-\d+)?):\s+(.+)

See a regex demo.

import re

pattern = r"^(.*? )?((?:[Ll]ist|[Tt]able|[Ff]igure))\s+(\d+(?:-\d+)?):\s+(.+)"
s = "Text Table 6-2:    Management of children study and actions"
m = re.match(pattern, s)
if m:
    print(m.groups())

Output

('Text ', 'Table', '6-2', 'Management of children study and actions')

If you want point 1 and 2 as one string, then you can use 2 capture groups instead.

^((?:.*? )?(?:[Ll]ist|[Tt]able|[Ff]igure)\s+\d+(?:-\d+)?):\s+(.+)

Regex demo

The output will be

('Text Table 6-2', 'Management of children study and actions')

Is there a way for me to learn the regex as you wrote? @The fourth bird — Ahmad, Mar 25 '22 at 09:17
@Ahmad There a some very informative sites like https://www.rexegg.com/regex-quickstart.html and https://www.regular-expressions.info/ — The fourth bird, Mar 25 '22 at 09:21

LaM0uette · Answer 2 · 2022-03-18T16:00:07.360

1

you have already had answers but I wanted to try your problem to train myself so I give you all the same what I found if you are interested:

((?:[a-zA-Z0-9]+)? ?(?:[Ll]ist|[Tt]able|[Ff]igure)).*?((?:[0-9]+\-[0-9]+)|(?<!-)[0-9]+): (.*)

And here is the link to my tests: https://regex101.com/r/7VpPM2/1

edited Mar 18 '22 at 16:00

answered Mar 18 '22 at 15:03

LaM0uette

73
2
8

Split String based on multiple Regex matches

2 Answers2