-5

xml file with about 2000 (texthere) parenthesis. I need to remove the parenthesis and text within it. I tried but am getting an error :(

import re, sys

    fileName = (sys.argv[2])


    with open(fileName) as f:

        input = f.read()
        output = re.sub(r'\(\w*\)', '', input)
        print fileName + " cleaned of all parenthesis"

and my error :

Traceback (most recent call last):
  File "/Users/eeamesX/work/data-scripts/removeParenFromXml.py", line 4, in <module>
    fileName = (sys.argv[2])
IndexError: list index out of range

I changed the (sys.argv[1])...I get no errors but also the parenthesis in my file.xml do not get removed?

Anekdotin
  • 1,531
  • 4
  • 21
  • 43

2 Answers2

1

Since you're calling the script as follows:

python removeparenthesis.py filename.xml

the XML file name will appear under sys.argv[1].

Also, you'd need to use lazy matching in your pattern:

r'\(\w*?\)'    # notice the ?

A better pattern would be:

r'\([^)]*\)'
hjpotter92
  • 78,589
  • 36
  • 144
  • 183
0

Do you have nested parens?

stuff (words (inside (other) words) eww)

Will you have multiple groups of parens?

stuff (first group) stuff (second group)

Does text within parens have spaces?

stuff (single_word)
stuff (multiple words)

A simple regex could be \(.*?\) although you'll see that the nested parens are not caught (which is fine if you do NOT expect nested parens):

https://regex101.com/r/kB2lU1/1

Edit:

https://regex101.com/r/kB2lU1/2 may be able to handle some of those nested parens, but may still break depending on different types of edge cases.

You'll need to specify what kinds of edge cases you expect so the answer can be better tailored to your needs.

OnlineCop
  • 4,019
  • 23
  • 35