Modify all local links in html file

Question

I want to change links from a html page like below:

//html
<html>
    <head>
        <title>Hello</title>
    </head>
    <body>
        <p>this is a simple text in html file</p>
        <a href="https://google.com">Google</a>
        <a href="/frontend/login/">Login</a>
        <a href="/something/work/">Something</a>
    </body>
 </html>



//Result
    <html>
        <head>
            <title>Hello</title>
        </head>
        <body>
            <p>this is a simple text in html file</p>
            <a href="https://google.com">Google</a>
            <a href="/more/frontend/login/part/">Login</a>
            <a href="/more/something/work/extra/">Something</a>
        </body>
     </html>

So how can I change html to result and save it as html using python ?

What did you try so far ? You could use [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to parse the html easily or another scraping library. — IMCoins, Apr 26 '19 at 10:32
Possible duplicate of [BeautifulSoup - modifying all links in a piece of HTML?](https://stackoverflow.com/questions/459981/beautifulsoup-modifying-all-links-in-a-piece-of-html) — QHarr, Apr 26 '19 at 10:52
This example replace link not modify link. I want to add more with previous link not a new link. — Himel Rana, Apr 26 '19 at 11:16

Nicolas · Answer 1 · 2019-04-26T15:07:47.363

0

Well, doing this via Regex is really simple.

Use href="\/([^"]*) as pattern and href="\/more\/\1additional as replacement.

Have a look here:

https://regex101.com/r/7ACBFY/2

Previous "50% attempt" (sry that I missed you second part):

https://regex101.com/r/7ACBFY/1

edited Apr 26 '19 at 15:07

answered Apr 26 '19 at 10:40

Nicolas

754
8
22

Thanks you solved 50%. I want to add more in first part and at the end of link add more text like (extra) like below: /more/something-previously-exists/extra/ – Himel Rana Apr 26 '19 at 11:02
Your regex href="\/([^"]*) may contain invalid link – Himel Rana Apr 26 '19 at 15:34
What do you mean? – Nicolas Apr 26 '19 at 19:10

score 0 · Answer 2 · answered Apr 26 '19 at 10:44

0

If you store the html file as a string (e.g. html), then you can do a simple replace:

result = html.replace('<a href="/', '<a href="/more/')

answered Apr 26 '19 at 10:44

Ollie

1,641
1
13
31

Link is not empty. I may contain some previous data. I want to add more with previous value. – Himel Rana Apr 26 '19 at 11:00

score 0 · Accepted Answer · answered Apr 26 '19 at 13:59

I have solved it by own. But I think this can help a lot of people. That's why I am answering my question and leave it at publicly available

Thank you Nicolas. His 30-50% solution helped me a lot for complete solution.

import re

regex = r"href=\"\/"

test_str = ("<html>\n"
    "    <head>\n"
    "        <title>Hello</title>\n"
    "    </head>\n"
    "    <body>\n"
    "        <p>this is a simple text in html file</p>\n"
    "        <a href=\"https://google.com\">Google</a>\n"
    "        <a href=\"/front-end/login/\">Login</a>\n"
    "        <a href=\"/something/work/\">Something</a>\n"
    "    </body>\n"
    " </html>")

subst = "href=\"/more/"

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

subst2 = "\\1hello/"
regex2 = r"(href=\"/(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)"
result2 = re.sub(regex2, subst2, result, 0, re.MULTILINE)

if result2:
    print (result2)

writtingtofile = open("solution.html","w")
writtingtofile.write(result2)
writtingtofile.close()

Output:

Good work. Sorry, I didn't noticed your second additional term. If you want to simplify your solution, check out this (I will edit my answer as well): https://regex101.com/r/7ACBFY/2 — Nicolas, Apr 26 '19 at 15:05

Modify all local links in html file

3 Answers3