0

For an NLP project, I'm looking to preprocess text which sometimes contains unwanted content in curly braces that looks like JSON, for example:

Some useful content here {"contentId":"QI9GPST0AFB401","dimensions":{"large_desktop":[[120,60]]}} good stuff here {some other curly braces}

All I want to do is remove the text within curly braces, to be left with

Some useful content here good stuff here

The complexity seems to come from the fact that there's multiple sets of curly braces, which disqualifies solutions like this one, and that there's nested curly braces, which disqualifies regex-based solutions like this one.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Chris Schmitz
  • 618
  • 1
  • 6
  • 16

3 Answers3

1

In 99% of the cases, regex will do the job.

import re

s = 'Some useful content here {"contentId":"QI9GPST0AFB401","dimensions":{"large_desktop":[[120,60]]}} good stuff here {some other curly braces}'

ss = re.sub(r'{[^}]*}*', '', s)

print(ss)

ouput

Some useful content here  good stuff here 
0

That sounds like a good job interview question :)

I'd just parse the string manually:

def cleanString(dirty):
    clean = ""
    bad = 0

    for c in dirty:
        if c == '{':
            bad += 1
        elif c == '}':
            bad -= 1
        elif bad == 0:
            clean +=c

    return clean
Boern
  • 7,233
  • 5
  • 55
  • 86
0

I can’t try this at the moment, but this should work as long as the braces are always exactly balanced:

import re
def removeBracedContent(s):
    while ‘{‘ in s:
        if ‘}’ not in s:
            raise Exception( “The braces weren’t balanced - too many {” )
        s = re.sub(‘{[^{}]*}’,’’,s)
    if ‘}’ in s:
        raise Exception( “The braces weren’t balanced - too many }” )
    return s

This repeatedly removed the innermost matching braces including any text between them until there aren’t any more.

If the content in braces is spread across more than one line they you’ll have to add a final parameter to re.sub - flags=re.DOTALL