1

I am trying to parse text from document using regex. Document contains different structure i.e. section 1.2, section (1). Below regex is able to parse text with decimal point but fails for ().

Any suggestion to handle content which starts with ().

For example:

import re
RAW_Data = '(4) The Governor-General may arrange\n with the Chief Minister of the Australian Capital Territory for the variation or revocation of an \n\narrangement in force under subsection (3). \nNorthern Territory \n (5) The Governor-General may make arrangements with the \nAdministrator of the Northern \nTerritory with respect to the'

f = re.findall(r'(^\d+\.[\d\.]*)(.*?)(?=^\d+\.[\d\.]*)', RAW_Data,re.DOTALL|re.M|re.S)
for z in f:
    z=(''.join(z).strip().replace('\n',''))
    print(z)

Expected output:

(4) The Governor-General may arrange with the Chief Minister of the Australian Capital Territory for the variation or revocation of an arrangement in force under subsection

(3) Northern Territory

(5) The Governor-General may make arrangements with the Administrator of the Northern Territory with respect to the'

data_nerd
  • 81
  • 1
  • 5

3 Answers3

0

You can try:

(?<=(\(\d\)|\d\.\d))(.(?!\(\d\)|\d\.\d))*

To understand how it works, consider the following block:

(\(\d\)|\d\.\d)

It looks for strings which are of type (X) or X.Y, where X and Y are numbers. Let's call such string 'delimiters'.

Now, the regex above, looks for the first character preceeded by a delimiter (positive lookbehind) and matches the following characters until it finds one which is followed by the delimiter (negative lookhaed).

Try it here!

Hope it helps!

Neb
  • 2,270
  • 1
  • 12
  • 22
  • 1
    `(.(?!\(\d\)|\d\.\d))*` should be written as `(?:(?!\(\d\)|\d\.\d).)*`. However, to make the pattern work for OP you need to convert all capturing groups to non-capturing. – Wiktor Stribiżew Oct 08 '18 at 10:57
  • Thanks! your regex works better than mine. However, I do not understand it. Since it is a the `.` follows the negative lookhaed? Shouldn't it be placed before it? – Neb Oct 08 '18 at 11:05
  • I suggest studying [this answer](https://stackoverflow.com/questions/30900794/tempered-greedy-token-what-is-different-about-placing-the-dot-before-the-negat/37343088#37343088). – Wiktor Stribiżew Oct 08 '18 at 11:09
0

Use regex, [sS]ection\s*\(?\d+(?:\.\d+)?\)?

The (?\d+(?:\.\d+)?\)? will match any number with or without decimal or a brace

Regex

Nambi_0915
  • 1,091
  • 8
  • 21
0

There are a new RegEx \(\d\)[^(]+

  1. \(\d\) match any string like (1) (2) (3) ...

  2. [^(]+ match one or more char and stop matching when found (

    test on : on Regex101

But i wonder if you have a special example like (4) The Governor-General may arrange\n with the Chief Minister of the Austr ... (2) (3). \nNorthern Territory \n. It is a sentence from (4) to (2). Because my regex can not match this type of sentence.

Community
  • 1
  • 1
KC.
  • 2,981
  • 2
  • 12
  • 22