Replace string between two delimiters in html

Question

How can I replace some string located between the delimiters href="" ?

<td><a href="https://forms.office.com/Pages/ResponsePage.aspx?id=uI1n" target="_blank">https://forms.office.com/Pages/ResponsePage.aspx?id=uI1n</a></td>
    </tr>

I want to replace this:

href="https://forms.office.com/Pages/ResponsePage.aspx?id=uI1n"

with this:

href="LINK"

What code have you written in an attempt to do this? Can you share it here as a [mre], per guidance on [ask]? Where are you getting stuck? We aren't going to write your code for you, but we can assist if you can demonstrate a good-faith bare-minimum attempt at resolving this on your own before posting here. Why have you elected to use [tag:re] for this when it's widely accepted that [RegExp is not a good solution for parsing (X)HTML](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) due to the extensive edge cases you'll have to account for. — esqew, Nov 29 '21 at 14:13
You tagged this with `re` but I think you should use a HTML Library like BeautifulSoup, rather than trying to do this from scratch. — JeffUK, Nov 29 '21 at 14:13
https://stackoverflow.com/questions/2782097/is-there-a-built-in-package-to-parse-html-into-dom — mplungjan, Nov 29 '21 at 14:14
I am not scraping anything from web. I am generating an email body and filling it with data from a dataframe. When I generate the HTML code with df.to_html() it is shown like this with the whole link in the mail, not being readable for the user — alb, Nov 29 '21 at 14:16
@alb Who said you were scraping anything from the web...? "*it is shown like this with the whole link in the mail, not being readable for the user*" You haven't stated what you *expect* or desire instead, nor what code you've written (as a [mre]) to either (a) generate this e-mail in the first place, or (b) resolve the issue you describe. Please edit your question to include this and all pertinent debugging/contextual information as prescribed by [ask]. It's very difficult to answer this question as it currently stands. — esqew, Nov 29 '21 at 14:18
And don't you mean you want to change `https://forms.office.com/Pages/ResponsePage.aspx?id=uI1n` to `LINK` otherwise it makes no sense — mplungjan, Nov 29 '21 at 14:21

Frederik Rogalski · Accepted Answer · 2021-11-29T23:32:14.147

1

For a quick and dirty way, you could use re.sub() to match the 'href' tag and replace it with your own:

import re
html = """<td><a href="https://forms.office.com/Pages/ResponsePage.aspx?id=uI1n" target="_blank">https://forms.office.com/Pages/ResponsePage.aspx?id=uI1n</a></td>
    </tr>"""
re.sub('">.*<\/a>', '">LINK<\/a>" ' , html)

Output:

'<td><a href="LINK" target="_blank">https://forms.office.com/Pages/ResponsePage.aspx?id=uI1n</a></td>\n    </tr>'

But remember that parsing HTML with regular expressions is not recommended, as it can have many edge cases. I would only use this for a quick and dirty way when I absolutely know how my input HTML is structured. For a more professional approach, you should look into HTML parsers (e.g. 'beautifulsoup').

edited Nov 29 '21 at 23:32

answered Nov 29 '21 at 14:24

Frederik Rogalski

154
2
7

thanks for your answer, but as @mplungjan stated, I'd have to replace the other part of the string to rename the hyperlink. So the part after is closed. How it would be? – alb Nov 29 '21 at 15:53
I edited the answer to reflect what you wanted to do. – Frederik Rogalski Nov 29 '21 at 23:32
@alb see the changes. – Frederik Rogalski Nov 30 '21 at 01:24
1

thanks it worked! – alb Dec 01 '21 at 19:15

Replace string between two delimiters in html

1 Answers1