1

Sample text:

"115 There was objective evidence to establish that the enactment of national laws for various mandatory IMO instruments and their amendments (including tacit amendments) were subject to delays and there was a lack of established procedures and commitment from relevant authorities to assist the process (SOLAS 1974, article I; MARPOL, article 1; LL 1966, article 1; III Code, paragraph 4; III Code, paragraph 8; III Code, paragraph 11)."

I want to extract:

"SOLAS 1974, article I; MARPOL, article 1; LL 1966, article 1; III Code, paragraph 4; III Code, paragraph 8; III Code, paragraph 11"

I have used re.findall(r'((.*III .*)) however this returns:

"(including tacit amendments) were subject to delays and there was a lack of established procedures and commitment from relevant authorities to assist the process (SOLAS 1974, article I; MARPOL, article 1; LL 1966, article 1; III Code, paragraph 4; III Code, paragraph 8; III Code, paragraph 11)"

Any ideas, driving me crazy!

  • 1
    How about using `re.findall('\(.*?\)',s)`? assuming `s` is the variable name that has the text. – Meh Dec 10 '19 at 20:00
  • Use `\([^)]*\)` - it's faster than the lazy quantifier approach suggested by @user2977071 - but same idea. – ctwheels Dec 10 '19 at 20:03
  • Is it always going to be at the end of the text? – Peter Wood Dec 10 '19 at 20:04
  • @ctwheels never knew they have performance difference, thanks for sharing your knowledge :) – Meh Dec 10 '19 at 20:04
  • The expression `.*III .*` should match the whole any string containing `III` –  Dec 10 '19 at 20:04
  • 2
    @user2977071 `.*?` backtracks whereas `[^)]*` doesn't :) – ctwheels Dec 10 '19 at 20:05
  • @ctwheels you should convert your answer-comment into an answer – Cireo Dec 10 '19 at 20:07
  • Use the `\(.*?\)` it is the fastest, if that is a concern. –  Dec 10 '19 at 20:09
  • `Regex1: \([^)]*\) Completed iterations: 50 / 50 ( x 1000 ) Matches found per iteration: 2 Elapsed Time: 0.55 s, 550.11 ms, 550106 µs Matches per sec: 181,783 Regex2: \(.*?\) Completed iterations: 50 / 50 ( x 1000 ) Matches found per iteration: 2 Elapsed Time: 0.39 s, 389.40 ms, 389404 µs Matches per sec: 256,802` –  Dec 10 '19 at 20:15
  • @x15 not sure where you're getting those figures from, but here: https://tio.run/##bVLBbtswDD3PX8Fb7CIwZrTruhXFUPRkIEWGdbd1BRiJjtnKkkHRCbKfz@S46BY0ugh67@npkVS/0zb48/2euz6IgnJHrHMQyrIINzCrqk/wsyUh2GKEsHomo7whoA1b8oZAA1BUXDmOLWiLyaNNtEejHXmF0IBH5eDRgcNthCYIbFA4DBE69BY1yA7q@yWwjyrDeCtCIkYjFsAE2AnM2Rs3WPZrUDSs/3EFbMeQcTgkHFNZcrh7M3otAFMG8zKGegtNFnoJhuwgNMlN6Dqe0jcSutQMRxtMJxxSt4SVkzC9gDFynOo9OMSU8GG5uH2A6svnizmgKBtHUF/D/e2P78vFP6i6hsUiyS4vj7C6ruEuWJpDj4Jrwb6Fi9Pw1Wm4qopylmVrScMTKlMpPTvKZfaY/3oqfp89FrMic@/Z8uzbgcosNbAWIrvLi6/ZByEdxCekbDjNyrk8vooc/jmSuGNJL@w1n/5TOW355DsHP3Qrkpvq47iK4qR2tH@n3O//Ag – ctwheels Dec 10 '19 at 20:22
  • Thank you for your really useful comments: – John Russell Dec 10 '19 at 20:22
  • @x15 apologies, typo at that URL, fixed [here](https://tio.run/##bVLBjtMwED2Tr5hbk1UVEbEsC6sVWnGK1FURy40FaWpPmgHHjsaTVuXni9OUQrX1xfJ7b968sd3vtA3@zX7PXR9EQbkj1jkIZVmEe5hV1Vv42pIQbDFCWP0ko7whoA1b8oZAA1BUXDmOLWiLyaNNtEejHXmF0IBH5eDRgcNthCYIbFA4DBE69BY1yA7qxyWwjyrDWBUhEaMRC2AC7ATm7I0bLPs1KBrW/7gCtmPIOBwSjqksOdydjI4DYMpgfo2hTqHJQi/BkB2EJrkJXcdT@kZCly7D0QbTCYd0W8LKSZg6YIwcp3kPDjElfFouHp6gev/ueg4oysYR1Hfw@PDl83LxD6ruYLFIspubM6yua/gULM2hR8G1YN/C9WX49jJcVUU5y7K1pMcTKtMoPTvKZfacf/tRfL96LmZF5i6w5dXHA5dZamAtRHaXFx@yV0I6iE9I2XB6LOfyeBQ5/H0mceeSXthr/tepOJ6nD1ZO25Gdgx@6Fcl99Xpcxal46nC5dOReFO73fwA) – ctwheels Dec 10 '19 at 20:35

1 Answers1

1

It's unclear if you want to only match parentheses with III within them. In any case, I'll provide solutions with and without that check below.


Extract text between parentheses

See this regex in use here.

\([^)]*\)

How it works:

  • \( match this character literally (
  • [^)]* matches any character except ) any number of times
  • \) match this character literally )

Extract text between parentheses if it contains III

See this regex in use here.

\([^)]*I{3}[^)]*\)

Same logic as previously, just ensures III exists (I{3}).


Performance

In the second example, I{3} matches I exactly 3 times. This is more efficient than III.

It was also mentioned that .*? can be used to replace [^)]* - while this is true, there's a performance cost associated with it since .*? backtracks to match as few as possible. The negated character class method remains greedy and prevents the need to backtrack making it more efficient.

You can check this performance comparison here.

ctwheels
  • 21,901
  • 9
  • 42
  • 77
  • Thank for the reply! – John Russell Dec 10 '19 at 20:43
  • I will test the answers and update post when I hopefully get this resolved. – John Russell Dec 10 '19 at 21:01
  • I used ([^)]I{3}[^)]) and it worked except for *(SOLAS 1974, article I; MARPOL, article 1; MARPOL, article 6(3); LL 1966, article 1; TONNAGE 1969, article 1; COLREG 1972, article I; III Code, paragraph 8)* - see regex example [HERE](https://regex101.com/r/lKdGqm/8). It appears 6(3) is causing the problem. When you delete change it to 6(3 the problem goes away. Any suggestions? – John Russell Dec 11 '19 at 00:39
  • @JohnRussell yes exactly, cause it stops at `)`. How many nested parentheses can your sentences contain? In this case you have one nested parenthesis. Is there a hard limit? If not, you'll need to use a different python package (PyPi regex) as python's re package doesn't support variable length lookbehinds or recursion, which is necessary for matching nested structures. – ctwheels Dec 11 '19 at 02:37
  • Thank you for all your advise and comments. I am struggling to write a regex to locate (*MARPOL, article 4(4); III Code, paragraph 8*) to (*, article 4(4); III Code, paragraph 8*) and (*MARPOL, article 4(4); III Code, paragraph 8*) to (*MARPOL, article 4(4)(2); III Code, paragraph 8*). – John Russell Dec 12 '19 at 19:29
  • @JohnRussell using the `regex` library instead of `re`, you can use `r'(\((?:(?1)|[^()])*\))(?<=I{3}.*)'` as seen [here](https://tio.run/##ndC7CsJAEAXQOvsV02VGQvBVSFSCWAUUxdYHLHFNFrLZZRIhon57NNjaaDdc7rnFuFud23LUtto4yzWwylQjBMP8c4apNU4XCtnHA2IcYTygx/6EdKTegQjj2Ty5j55hj3wS1ZvthefjerHbblYBSK51WigY45imkCQJLO1ZBeAky4yly2FCftCJX7q/r38TOPwffTPiKC6WoQFdQhUJz3Q/DCslOc2xIeHpC5h37jnWZY0mzNheHfaJ2vYF) – ctwheels Dec 12 '19 at 20:34