How to remove text between two double brackets in Python

Question

I am taking some markdown, turning it into html, then parsing out text without tags to leave me with a clean set of alphanumeric characters only.

The problem is the markdown has some custom components it it that I am having trouble parsing out.

Here is an example:

{{< custom type="phase1" >}}
    Some Text in here (I want to keep this)
{{< /custom >}}

I want to be able to delete everything in between the {{ & }} brackets (including the brackets), while keeping the text in between the first and second instance. Essentially, I just want to be able remove all instances of {{ *? }} in the file. There can be any number in a given file.

Here is what I have tried:

def clean_markdown(self, text_string):
  html = markdown.markdown(text_string)
  soup = BeautifulSoup(html, features="html.parser")
  # to_extract = soup.findAll('script') //Tried to extract via soup but no joy as not tags
  cleaned = re.sub(r'([^-.\s\w])+', '', soup.text)
  return cleaned

This works well for everything in the markdown except it leaves the value in the text that is between the {{ & }}. So, in this case the word "custom" will be in my cleaned text, but I don't want it to be.

As you can see, I tried to extract using beautiful soup but it didn't work as the start value ({{) is different to the end value (}})

Does anyone have any ideas how to efficiently implement a parser in Python that would clean this?

Are the snippets you want to clean up always in the same triplet format you have in the question? — Jack Fleeting, Mar 31 '20 at 12:50
Get a library that works with markdown specifically, or create your own that handles every custom component. For your attempt to use BeautifulSoup and Regex, see https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags . — NadavS, Mar 31 '20 at 12:51
@JackFleeting, no they aren't. There can be instances with brackets within brackets. EG: {{ custom }} Hello, how are you {{< more text>}} today? {{/custom}} — Donal Rafferty, Mar 31 '20 at 12:57

score 1 · Accepted Answer · answered Mar 31 '20 at 12:52

1

If I understand what you are trying to do correctly, you should be able to use re.sub to replace all the {{...}} patterns with an empty string directly in the text_tring parameter

def clean_markdown(self, text_string): 
    return re.sub("{{.*}}","",text_string)

answered Mar 31 '20 at 12:52

Alain T.

40,517
4
31
51

@JonClements you are right (I assumed only one per line which may not be the case indeed) – Alain T. Mar 31 '20 at 12:56

maor10 · Answer 2 · 2020-03-31T12:58:45.807

1

Using a regex match should work well:

def clean_markdown(self, text_string):
    html = markdown.markdown(text_string)
    soup = BeautifulSoup(html, features="html.parser")
    # to_extract = soup.findAll('script') //Tried to extract via soup but no joy as not tags
    match = re.match("{{.+}}\n(?P<text>.*)\n{{.+}}", soup.text, re.MULTILINE)
    cleaned = match.groupdict()['text']
    return cleaned

edited Mar 31 '20 at 12:58

answered Mar 31 '20 at 12:53

maor10

1,645
18
28

score 1 · Answer 3 · answered Mar 31 '20 at 12:56

1

IIUC: Try this:

result = re.sub(r"\{\{.*?\}\}", "", string).strip()
print(result)

Output:

Some Text in here (I want to keep this)

answered Mar 31 '20 at 12:56

Shubham Sharma

68,127
6
24
53

How to remove text between two double brackets in Python

3 Answers3