1

I want to change links from a html page like below:

//html
<html>
    <head>
        <title>Hello</title>
    </head>
    <body>
        <p>this is a simple text in html file</p>
        <a href="https://google.com">Google</a>
        <a href="/frontend/login/">Login</a>
        <a href="/something/work/">Something</a>
    </body>
 </html>



//Result
    <html>
        <head>
            <title>Hello</title>
        </head>
        <body>
            <p>this is a simple text in html file</p>
            <a href="https://google.com">Google</a>
            <a href="/more/frontend/login/part/">Login</a>
            <a href="/more/something/work/extra/">Something</a>
        </body>
     </html>

So how can I change html to result and save it as html using python ?

Himel Rana
  • 666
  • 5
  • 14
  • What did you try so far ? You could use [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to parse the html easily or another scraping library. – IMCoins Apr 26 '19 at 10:32
  • Possible duplicate of [BeautifulSoup - modifying all links in a piece of HTML?](https://stackoverflow.com/questions/459981/beautifulsoup-modifying-all-links-in-a-piece-of-html) – QHarr Apr 26 '19 at 10:52
  • This example replace link not modify link. I want to add more with previous link not a new link. – Himel Rana Apr 26 '19 at 11:16
  • Think that example used replace – QHarr Apr 26 '19 at 13:48

3 Answers3

0

Well, doing this via Regex is really simple.

Use href="\/([^"]*) as pattern and href="\/more\/\1additional as replacement.

Have a look here:

https://regex101.com/r/7ACBFY/2


Previous "50% attempt" (sry that I missed you second part):

https://regex101.com/r/7ACBFY/1

Nicolas
  • 754
  • 8
  • 22
0

If you store the html file as a string (e.g. html), then you can do a simple replace:

result = html.replace('<a href="/', '<a href="/more/')
Ollie
  • 1,641
  • 1
  • 13
  • 31
0

I have solved it by own. But I think this can help a lot of people. That's why I am answering my question and leave it at publicly available

Thank you Nicolas. His 30-50% solution helped me a lot for complete solution.

import re

regex = r"href=\"\/"

test_str = ("<html>\n"
    "    <head>\n"
    "        <title>Hello</title>\n"
    "    </head>\n"
    "    <body>\n"
    "        <p>this is a simple text in html file</p>\n"
    "        <a href=\"https://google.com\">Google</a>\n"
    "        <a href=\"/front-end/login/\">Login</a>\n"
    "        <a href=\"/something/work/\">Something</a>\n"
    "    </body>\n"
    " </html>")

subst = "href=\"/more/"

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

subst2 = "\\1hello/"
regex2 = r"(href=\"/(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)"
result2 = re.sub(regex2, subst2, result, 0, re.MULTILINE)

if result2:
    print (result2)

writtingtofile = open("solution.html","w")
writtingtofile.write(result2)
writtingtofile.close()

Output:

enter image description here

Himel Rana
  • 666
  • 5
  • 14
  • Good work. Sorry, I didn't noticed your second additional term. If you want to simplify your solution, check out this (I will edit my answer as well): https://regex101.com/r/7ACBFY/2 – Nicolas Apr 26 '19 at 15:05