Errors when trying to remove parentheses in python text

Question

I've been working on a bit of code to take a bunch of histograms from other files and plot them together. In order to make sure the legend displays correctly I've been trying to take the titles of these original histograms and cut out a bit of information that isn't needed any more.

The section I don't need takes the form (A mass=200 GeV), I've had no problem removing what's inside the parentheses, unfortunately everything I've tried for the parentheses themselves either has no effect, negates the code that removes the text, or throws errors.

I've tried using suggestions from; Remove parenthesis and text in a file using Python and How can I remove text within parentheses with a regex?

The error my current attempt gives me is

'str' object cannot be interpreted as an integer

This is the section of the code:

histo_name = ''

# this is a list of things we do not want to show up in our legend keys
REMOVE_LIST = ["(A mass = 200 GeV)"]

# these two lines use the re module to remove things from a piece of text
# that are specified in the remove list
remove = '|'.join(REMOVE_LIST)
regex = re.compile(r'\b('+remove+r')\b')

# Creating the correct name for the stacked histogram
for histo in histos:

    if histo == histos[0]:

        # place_holder contains the edited string we want to set the
        # histogram title to
        place_holder = regex.sub('', str(histo.GetName()))
        histo_name += str(place_holder)
        histo.SetTitle(histo_name)

    else:

        place_holder = regex.sub(r'\(\w*\)', '', str(histo.GetName()))
        histo_name += ' + ' + str(place_holder)
        histo.SetTitle(histo_name)

The if/else bit is just because the first histogram I pass in isn't getting stacked so I just want it to keep it's own name, while the rest are stacked in order hence the '+' etc, but I thought I'd include it.

Apologies if I've done something really obvious wrong, I'm still learning!

Please check [my today's answer](http://stackoverflow.com/a/31469645/3832970). You need to escape brackets with `re.escape` if you plan to match them literally with a regex. Try with `remove = '|'.join([re.escape(x) for x in REMOVE_LIST])` — Wiktor Stribiżew, Jul 17 '15 at 13:00

James Elderfield · Answer 1 · 2015-07-17T14:02:14.780

1

From the python docs - To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)].

So use one of the above patterns instead of the plain brackets in your regex. e.g.REMOVE_LIST = ["\(A mass = 200 GeV\)"]

EDIT: The issue seems to be with your use of \b in the regex - which according to the docs linked above also matches the braces. My seemingly-working example is,

import re

# Test input
myTestString = "someMess (A mass = 200 GeV) and other mess (remove me if you can)"
replaceWith = "HEY THERE FRIEND"

# What to remove
removeList = [r"\(A mass = 200 GeV\)", r"\(remove me if you can\)"]

# Build the regex
remove = r'(' + '|'.join(removeList) + r')'
regex = re.compile(remove)

# Try it!
out = regex.sub(replaceWith, myTestString)

# See if it worked
print(out)

edited Jul 17 '15 at 14:02

answered Jul 17 '15 at 13:03

James Elderfield

2,389
1
34
39

I tried `REMOVE_LIST = ["\(A mass = 200 GeV\)"]` as well as `REMOVE_LIST = ["[(]A mass = 200 GeV[)]"` both with and without the` r'\(\w*\)'` part inside place_holder. Without this nothing was removed at all, with it I got `TypeError: 'str' object cannot be interpreted as an integer` – Ciara Jul 17 '15 at 13:27
The `\b` cannot match anything, it is a word boundary assertion (and works like a look-around that consumes nothing). So, `\b` does not match braces, it just asserts the position between a word and a non-word character. – Wiktor Stribiżew Jul 20 '15 at 07:24

score 0 · Answer 2 · answered Jul 17 '15 at 13:57

There are 2 problems you are facing

You join your strings into a regex pattern without escaping
You are using word boundaries, but some of your entries start/end with a non-word letter (thus, you will never match ) with r"\)\b").

This fixes the first issue, but not the second (it finds More+[fun]+text only):

REMOVE_LIST = ["(A mass = 200 GeV)", "More+[fun]+text"]
remove = '|'.join([re.escape(x) for x in REMOVE_LIST])
ptrn = r'\b(?:'+remove+r')\b'
print ptrn
regex = re.compile(ptrn)
print regex.findall("Now, (A mass = 200 GeV) and More+[fun]+text inside")

You'd need a smarter way to create your pattern. Like this:

import re
REMOVE_LIST = ["(A mass = 200 GeV)", "More+[fun]+text"]

remove_with_boundaries = '|'.join([re.escape(x) for x in REMOVE_LIST if re.match(r'\w', x) and re.search(r'\w$', x)])
remove_with_no_boundaries = '|'.join([re.escape(x) for x in REMOVE_LIST if not re.match(r'\w', x) and not re.search(r'\w$', x)])
remove_with_right_boundaries = '|'.join([re.escape(x) for x in REMOVE_LIST if not re.match(r'\w', x) and re.search(r'\w$', x)])
remove_with_left_boundaries = '|'.join([re.escape(x) for x in REMOVE_LIST if re.match(r'\w', x) and not re.search(r'\w$', x)])

ptrn = ''
if len(remove_with_boundaries) > 0:
    ptrn += r'\b(?:'+remove_with_boundaries+r')\b'
if len(remove_with_left_boundaries) > 0:
    ptrn += r'|\b(?:' + remove_with_left_boundaries + r')'
if len(remove_with_right_boundaries) > 0:
    ptrn += r'|(?:' + remove_with_right_boundaries + r')\b'
if len(remove_with_no_boundaries) > 0:
    ptrn += r'|(?:' + remove_with_no_boundaries + r')'

print ptrn
regex = re.compile(ptrn)
print regex.findall("Now, (A mass = 200 GeV) and More+[fun]+text inside")

See IDEONE demo

For the two ["(A mass = 200 GeV)", "More+[fun]+text"] entries as input, the regex \b(?:More\+\[fun\]\+text)\b|(?:\(A\ mass\ \=\ 200\ GeV\)) is generated and the output is ['(A mass = 200 GeV)', 'More+[fun]+text'].

Errors when trying to remove parentheses in python text

2 Answers2

Linked