I am taking some markdown, turning it into html, then parsing out text without tags to leave me with a clean set of alphanumeric characters only.
The problem is the markdown has some custom components it it that I am having trouble parsing out.
Here is an example:
{{< custom type="phase1" >}}
Some Text in here (I want to keep this)
{{< /custom >}}
I want to be able to delete everything in between the {{ & }} brackets (including the brackets), while keeping the text in between the first and second instance. Essentially, I just want to be able remove all instances of {{ *? }} in the file. There can be any number in a given file.
Here is what I have tried:
def clean_markdown(self, text_string):
html = markdown.markdown(text_string)
soup = BeautifulSoup(html, features="html.parser")
# to_extract = soup.findAll('script') //Tried to extract via soup but no joy as not tags
cleaned = re.sub(r'([^-.\s\w])+', '', soup.text)
return cleaned
This works well for everything in the markdown except it leaves the value in the text that is between the {{ & }}. So, in this case the word "custom" will be in my cleaned text, but I don't want it to be.
As you can see, I tried to extract using beautiful soup but it didn't work as the start value ({{) is different to the end value (}})
Does anyone have any ideas how to efficiently implement a parser in Python that would clean this?