0

I currently have a body of text as such

text = "hello this [is a cool] line of text that might have [two] brackets.

What I need is to parse, and replace this text, so in this example it would end up like

text = "hello this <a href='/phrase/is a cool/'>is a cool</a> line of text that might have <a href='/phrase/two/'>two</a> brackets.

Now I think in regex to find everything brackets is \[.*?\], but I'm unsure how to do this specifically.

nadermx
  • 2,596
  • 7
  • 31
  • 66

2 Answers2

1

You can do this by following

  1. Get all substrings enclosed by [ and ]
  2. Replace the content with appropriate text
>>> import re
>>> txt = "hello this [is a cool] line of text that might have [two] brackets."
>>> phrases = re.findall(r"(\[.+?\])", txt)
>>> for phrase in phrases:
...     txt = txt.replace(phrase, "<a href='/phrase/{}/'>{}</a>".format(phrase[1:-1], phrase[1:-1]))
... 
>>> txt
"hello this <a href='/phrase/is a cool/'>is a cool</a> line of text that might have <a href='/phrase/two/'>two</a> brackets."
>>> 
tbhaxor
  • 1,659
  • 2
  • 13
  • 43
1

You can do it like this:

import re

text = "hello this [is a cool] line of text that might have [two] brackets."

brackets = re.compile(r'\[(.*?)\]')
new_text = brackets.sub(lambda x: f'<a href=/phrases/{x.group(1)}>{x.group(1)}</a>', text)

print(new_text)

This will replace the pattern with what the lambda returns:
x.group(1) returns the first group in the regex pattern (indexing starts from 1): (.*?), meaning it will return only the text in between brackets and then format it using f strings.

To also remove any punctuation from the text in the brackets this code could be used (notice how the end result doesn't have any of the . that were in between the brackets):

import re
import string

text = "hello this [is a..... cool] line of text that might have [two] brackets."


def replace_with_link(match):
    info = match.group(1)
    info = info.translate(str.maketrans('', '', string.punctuation))
    return f'<a href="/phrases/{info}">{info}</a>'


brackets = re.compile(r'\[(.*?)\]')
new_text = brackets.sub(replace_with_link, text)

print(new_text)
Matiiss
  • 5,970
  • 2
  • 12
  • 29
  • Thank you, I suppose if inside the brackets they have special charectors like commas or periods, this will leave them in or strip them? and if so, any way to strip them? – nadermx Oct 09 '21 at 21:26
  • 1
    @nadermx well `.` matches any character (except newline) and doesn't really exclude anything, I will edit to add how to remove all the punctuation if that is what you were looking for – Matiiss Oct 09 '21 at 21:30
  • 1
    @nadermx I had added the code for punctuation, I am pretty sure that I wrote a comment saying that but now I don't see it so just in case I wrote this one – Matiiss Oct 10 '21 at 12:56