1

So, I have a 300+ page document, and I want to remove all the notes I wrote, which are enclosed within "[(" and ")]". Since I also sometimes nest multiple notes, "[(blah [(blah [(blah)])] )]", I need to make sure I don't just remove "[(blah [(blah [(blah)]".

So, to do that, I am not sure what is most efficient ... and this is a large job. What occurs to me is that I could check to see there aren't two consecutive "[(", with a ".*" between them, and just remove the simple cases of "[(...)]". I hope there is a better way than this, though.

I think the two regex codes I use would be something like "/(?<=[[(])[\s\S]*(?=[(])/gi" and "/(?![\s\S][[(][\s\S][[(]).*/gi". Something like that? I'm sorry, I'm still trying to figure out these things.

Also, can I write a python program to open an OpenOffice (odt) file and edit it? The "open(r'C:\Users\Blah\Documents\Blah.odt', 'rw').read()" will work for that too, right?

  • Does this answer your question? [Regexp to remove nested parenthesis](https://stackoverflow.com/questions/25335183/regexp-to-remove-nested-parenthesis) – Woodford Jan 07 '22 at 17:21
  • Can the number of nesting levels be more than two (where `’[(blah [(blah [(blah)])] )]’` may be regarded as having two nesting levels)? – Cary Swoveland Jan 07 '22 at 18:16
  • Any number of nesting levels can be present. I sometimes write several notes within notes, lol. – Iain Curtis-Shanley Jan 07 '22 at 19:19
  • @Woodford, the two answers to the question you reference do not use a regular expression (though one uses a simple one as part of a calculation). – Cary Swoveland Jan 07 '22 at 19:19
  • I suggest you edit your question to include that information, then we delete our comments. – Cary Swoveland Jan 07 '22 at 19:20
  • And, @Woodford, what you linked seems promising, although I am writing in Python and not java. But I shall look into trying to implement it as well. – Iain Curtis-Shanley Jan 07 '22 at 19:27

3 Answers3

1

Alternatively, you can use pyparsing as well.

import pyparsing as pp

pattern = pp.ZeroOrMore(pp.Regex(r'.*?(?=\[\()') + pp.Suppress(pp.nested_expr('[(', ')]'))) + pp.Regex(r'.*')
pattern = pattern.leave_whitespace()

txt = ''
result = ''.join(pattern.parse_string(txt))
assert result == ''

txt = 'blah'
result = ''.join(pattern.parse_string(txt))
assert result == 'blah'

txt = 'blah\nblah'
result = ''.join(pattern.parse_string(txt))
assert result == 'blah'

txt = '[(blah [(blah [(blah)])] )]'
result = ''.join(pattern.parse_string(txt))
assert result == ''

txt = ' blah [] blah () blah [( blah [] blah () )] blah [[]] blah (()) blah ([]) blah '
result = ''.join(pattern.parse_string(txt))
assert result == ' blah [] blah () blah  blah [[]] blah (()) blah ([]) blah '

txt = 'a[(b[(c)])]d[()]e[(f[(g[(h)]i[(j)])]k[(l[(m)])])n[(o)])]p[(q[(r)]s)]t[(u[(v[(w)]x[(y)]z)])]!'
result = ''.join(pattern.parse_string(txt))
assert result == 'adept!'

* pyparsing can be installed by pip install pyparsing

Note:

If a pair of parentheses gets broken inside [()] (for example a[(b[(c)], a[(b)]c)], etc), an unexpected result is obtained or IndexError is raised. So be careful to use. (See: Python extract string in a phrase)

quasi-human
  • 1,898
  • 1
  • 2
  • 13
  • Glad to see you are making good use of pyparsing! For code converters like this, you could have written this as just `pp.nested_expr('[(', ')]')).suppress().transform_string(txt)`. `transform_string` takes a str and returns a str of the parsed/converted text. For other alternatives to `parse_string`, also check out `scan_string` and `search_string`. – PaulMcG Feb 20 '22 at 10:33
  • 1
    I checked your method, and I said "Wow". That is a much cooler way than mine! I have to learn pyparsing more deeply. Thank you for sharing the cool technique:) – quasi-human Feb 20 '22 at 13:19
0

check out this way:

a = '[(blah [(blah [(blah)])] )]'

x = re.compile(r'([\[])(.*?)([\]])')
remove_text = re.sub(x, r'', a)
Dolan
  • 29
  • 1
  • 5
  • 1
    Sir, your method seems good and concise, but my knowledge is imperfect, so let me ask clarification on a point or two. I thought "\*" meant "0+" occurrences, right? The question mark the follows it ... marks it as "optional"? You put that in there in case I wrote something like'[()]' correct? And I want to catch spaces, new lines, etc, as well, just on the off chance I had them ... so instead of "*", would "[\s\S]" work better? Thank you for your help! – Iain Curtis-Shanley Jan 07 '22 at 19:23
  • You need to work on Regex more to get the whole points why and how something happened here. here is some useful links: https://stackoverflow.com/questions/3075130/what-is-the-difference-between-and-regular-expressions https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Cheatsheet https://developpaper.com/analyze-the-meaning-of-in-regular-expression/ – Dolan Jan 07 '22 at 19:52
  • 1
    Ah! I think I understand. By having it be "reluctant", you would grab the first ")]", correct? Would that then return "[(blah [(blah [(blah)]"? Would, as the link explains, a greedy selection here be better, since my chief objective is to remove all notes? That would select "[(blah [(blah [(blah)])] )]", since greedy works from the end and backtracks for a match. – Iain Curtis-Shanley Jan 07 '22 at 20:03
0

One way is to repeatedly remove (replace matches with empty strings) such clauses that do not contain clauses until no more replacements are made. If the maximum number of levels is n this will take n+1 iterations. The regular expression to match is as follows:

\[\((?:(?!\[\().)*?\)\]

Demo


Consider the string:

begin [(Mary [(had [(a )]lil' [(lamb [(whose [(fleece )])])])])]was [(white [( as )])]snow
      1      2     3    3     3      4       5         5 4 3 2 1    1       2      2 1

As shown, this has five nesting levels. After the first replacement we obtain:

begin [(Mary [(had lil' [(lamb [(whose )])])])]was [(white )]snow
      1      2          3      4        4 3 2 1    1        1

After the second replacement:

begin [(Mary [(had lil' [(lamb )])])]was snow
      1      2          3       3 2 1

After the third replacement:

begin [(Mary [(had lil' )])]was snow
      1      2           2 1

After the fourth replacement:

begin [(Mary )]was snow
      1       1

After the fifth replacement:

begin was snow

After the next attempted replacement:

begin was snow

As no replacements were made at the last step we are finished.


The regular expression can be broken down as follows.

\[\(        # match '[('
(?:         # begin non-capture group
  (?!\[\()  # negative lookahead asserts that next to chars are not '[('
  .         # match any char
)*?         # end non-capture group and execute zero or more times lazily
\)\]        # match ')]'

The regular expression employs a technique called the tempered greedy token solution.

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
  • Brilliant, sir! How does one learn to gain this level? Working on many projects, or through books ... ? But if I may ask, since I want to account for possible line breaks etc, I assume I can replace "." with "[\s\S]", right? – Iain Curtis-Shanley Jan 08 '22 at 07:56
  • Error: import sys import re from odf.opendocument import load from odf import text, teletype infile = load(r'C:\Users\Iainc\Documents\Blah Blah.odt') for item in infile.getElementsByType(text.P): s = teletype.extractText(item) m = re.sub(r'\[\((?:(?!\[\().)*?\)\]', '', s); if m != s: new_item = text.P() new_item.setAttribute('stylename', item.getAttribute('stylename')) new_item.addText(m) item.parentNode.insertBefore(new_item, item) item.parentNode.removeChild(item) infile.save('C:\Users\Iainc\Documents\Blah Blah.odt 2') – Iain Curtis-Shanley Jan 08 '22 at 08:27
  • The error reads thus: infile.save(r'C:\Users\Iainc\Documents\The Seventh Story 2.odt') File "", line 10 infile.save(r'C:\Users\Iainc\Documents\The Seventh Story 2.odt') ^^^^^^ SyntaxError: invalid syntax What on earth is the problem ...? So far as I can see, the code is good ... (Oh, of course, when I pasted the code above, it seems to have removed several things from the regex, but it is the same as it should be in actuality.) – Iain Curtis-Shanley Jan 08 '22 at 08:30
  • Regarding your first comment, yes, you could could use `[\s\S]` (or `[\w\W]`) but you can achieve the same result be simply setting the *single-line* flag. As to how one gains proficiency in the use of regular expressions, it's no different than the ways people gain proficiency in any computer language. In my own case, for the most part I picked it up by just hanging out in this forum. I am no expert, however. There is a mystique about regex's, but that's mainly due to the unfamiliar notation. The basics are actually quite straightforward. – Cary Swoveland Jan 08 '22 at 20:23
  • As I know only a wee bit of Python I can't help you resolve the error you got. I suggest you post the code at a site such as *tio.run* and then post a comment with a link to your code and ask readers to help you correct it. – Cary Swoveland Jan 08 '22 at 21:09