1

I am taking some markdown, turning it into html, then parsing out text without tags to leave me with a clean set of alphanumeric characters only.

The problem is the markdown has some custom components it it that I am having trouble parsing out.

Here is an example:

{{< custom type="phase1" >}}
    Some Text in here (I want to keep this)
{{< /custom >}}

I want to be able to delete everything in between the {{ & }} brackets (including the brackets), while keeping the text in between the first and second instance. Essentially, I just want to be able remove all instances of {{ *? }} in the file. There can be any number in a given file.

Here is what I have tried:

def clean_markdown(self, text_string):
  html = markdown.markdown(text_string)
  soup = BeautifulSoup(html, features="html.parser")
  # to_extract = soup.findAll('script') //Tried to extract via soup but no joy as not tags
  cleaned = re.sub(r'([^-.\s\w])+', '', soup.text)
  return cleaned

This works well for everything in the markdown except it leaves the value in the text that is between the {{ & }}. So, in this case the word "custom" will be in my cleaned text, but I don't want it to be.

As you can see, I tried to extract using beautiful soup but it didn't work as the start value ({{) is different to the end value (}})

Does anyone have any ideas how to efficiently implement a parser in Python that would clean this?

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
Donal Rafferty
  • 19,707
  • 39
  • 114
  • 191
  • Are the snippets you want to clean up always in the same triplet format you have in the question? – Jack Fleeting Mar 31 '20 at 12:50
  • Get a library that works with markdown specifically, or create your own that handles every custom component. For your attempt to use BeautifulSoup and Regex, see https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags . – NadavS Mar 31 '20 at 12:51
  • @JackFleeting, no they aren't. There can be instances with brackets within brackets. EG: {{ custom }} Hello, how are you {{< more text>}} today? {{/custom}} – Donal Rafferty Mar 31 '20 at 12:57

3 Answers3

1

If I understand what you are trying to do correctly, you should be able to use re.sub to replace all the {{...}} patterns with an empty string directly in the text_tring parameter

def clean_markdown(self, text_string): 
    return re.sub("{{.*}}","",text_string)
Alain T.
  • 40,517
  • 4
  • 31
  • 51
1

Using a regex match should work well:

def clean_markdown(self, text_string):
    html = markdown.markdown(text_string)
    soup = BeautifulSoup(html, features="html.parser")
    # to_extract = soup.findAll('script') //Tried to extract via soup but no joy as not tags
    match = re.match("{{.+}}\n(?P<text>.*)\n{{.+}}", soup.text, re.MULTILINE)
    cleaned = match.groupdict()['text']
    return cleaned
maor10
  • 1,645
  • 18
  • 28
1

IIUC: Try this:

result = re.sub(r"\{\{.*?\}\}", "", string).strip()
print(result)

Output:

Some Text in here (I want to keep this)
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53