Remove all text within (potentially nested) curly braces from string

Question

For an NLP project, I'm looking to preprocess text which sometimes contains unwanted content in curly braces that looks like JSON, for example:

Some useful content here {"contentId":"QI9GPST0AFB401","dimensions":{"large_desktop":[[120,60]]}} good stuff here {some other curly braces}

All I want to do is remove the text within curly braces, to be left with

Some useful content here good stuff here

The complexity seems to come from the fact that there's multiple sets of curly braces, which disqualifies solutions like this one, and that there's nested curly braces, which disqualifies regex-based solutions like this one.

Find the leftmost `{` index (`i1`), find the rightmost `}` index (`i2`). Delete from `i1` to `i2`. — felipe, Oct 17 '20 at 20:35
This doesn't work because of the multiple sets of curly braces I mentioned. In my example, it would remove 'good stuff here' — Chris Schmitz, Oct 17 '20 at 20:37
Please edit the code/regex of your attempt to solve this problem into your question. — DisappointedByUnaccountableMod, Oct 17 '20 at 20:48

score 1 · Answer 1 · 2020-10-18T10:10:20.327

1

In 99% of the cases, regex will do the job.

import re

s = 'Some useful content here {"contentId":"QI9GPST0AFB401","dimensions":{"large_desktop":[[120,60]]}} good stuff here {some other curly braces}'

ss = re.sub(r'{[^}]*}*', '', s)

print(ss)

ouput

Some useful content here  good stuff here

edited Oct 18 '20 at 10:10

answered Oct 17 '20 at 21:11

Boern · Accepted Answer · 2020-10-19T10:07:02.513

0

That sounds like a good job interview question :)

I'd just parse the string manually:

def cleanString(dirty):
    clean = ""
    bad = 0

    for c in dirty:
        if c == '{':
            bad += 1
        elif c == '}':
            bad -= 1
        elif bad == 0:
            clean +=c

    return clean

edited Oct 19 '20 at 10:07

answered Oct 17 '20 at 20:41

Boern

7,233
5
55
86

1

This works so well and looks so simple it's shaming me into revising Algos & Data Structures. Thanks! – Chris Schmitz Oct 17 '20 at 20:48

DisappointedByUnaccountableMod · Answer 3 · 2020-10-17T21:05:22.980

I can’t try this at the moment, but this should work as long as the braces are always exactly balanced:

import re
def removeBracedContent(s):
    while ‘{‘ in s:
        if ‘}’ not in s:
            raise Exception( “The braces weren’t balanced - too many {” )
        s = re.sub(‘{[^{}]*}’,’’,s)
    if ‘}’ in s:
        raise Exception( “The braces weren’t balanced - too many }” )
    return s

This repeatedly removed the innermost matching braces including any text between them until there aren’t any more.

If the content in braces is spread across more than one line they you’ll have to add a final parameter to re.sub - flags=re.DOTALL

Remove all text within (potentially nested) curly braces from string

3 Answers3