How to strip this link to remove unwanted data (bs4)?

Question

This is what the HTML looks like:

<div class="full-news none">
     Demo: <a href="https://www.lolinez.com/?https://www.makemytrip.com" 
    rel="external noopener noreferrer" target="_blank">https://www.makemytrip.com</a>
   <br/>

How can I remove this part from the href: https://www.lolinez.com/?, so that the final output becomes like this:

 <div class="full-news none">
         Demo: <a href="https://www.makemytrip.com" 
        rel="external noopener noreferrer" target="_blank">https://www.makemytrip.com</a>
       <br/>

I have tried using the decompose function of beautiful soup, but it completely removes the entire tag, How can this be fixed?

Can you expand your question and clarify why you can't just find&replace or another form of replace? — yuuuu, Dec 01 '21 at 12:33
Why do manual find and replace buddy ? automating stuffs, thats the fun of python..i guess... — Sainita, Dec 01 '21 at 12:35
You can automate find&replace in python. If that's a suitable solution then you can use replace() https://www.geeksforgeeks.org/python-string-replace/ — yuuuu, Dec 01 '21 at 12:42
that i know, but how to navigate inside this html structure, then do the needed, let me know if you know the solution to it — Sainita, Dec 01 '21 at 12:43
Without some more context on why you want to do that specifically (hence my original comment) its hard to say, but this may work for you: https://stackoverflow.com/questions/459981/beautifulsoup-modifying-all-links-in-a-piece-of-html You can select only certain links to modify using normal bs4 techniques. — yuuuu, Dec 01 '21 at 12:45

HedgeHog · Accepted Answer · 2021-12-01T13:53:48.243

Note Without additional context I would narrow down to following approaches

Option#1

Replace your substring the string that you pass to BeautifulSoup constructor:

soup = BeautifulSoup(YOUR_STRING.replace('https://www.lolinez.com/?',''), 'lxml')

Option#2

Replace the substring in your soup you can select all the <a> that contains www.lolinez.com and replace the value of its href:

for x in soup.select('a[href*="www.lolinez.com"]'):
    x['href'] = x['href'].replace('https://www.lolinez.com/?','')

Example

import bs4, requests
from bs4 import BeautifulSoup

html='''
<a href="https://www.lolinez.com/?https://www.makemytrip.com" rel="external noopener noreferrer" target="_blank">https://www.makemytrip.com</a>
<a href="https://www.makemytrip.com" rel="external noopener noreferrer" target="_blank">https://www.makemytrip.com</a>
<a href="https://www.lolinez.com/?https://www.makemytrip.com" rel="external noopener noreferrer" target="_blank">https://www.makemytrip.com</a>
'''

soup = BeautifulSoup(html, 'lxml')

for x in soup.select('a[href*="www.lolinez.com"]'):
    x['href'] = x['href'].replace('https://www.lolinez.com/?','')
    
soup

Output

<html><body><a href="https://www.makemytrip.com" rel="external noopener noreferrer" target="_blank">https://www.makemytrip.com</a><a href="https://www.makemytrip.com" rel="external noopener noreferrer" target="_blank">https://www.makemytrip.com</a><a href="https://www.makemytrip.com" rel="external noopener noreferrer" target="_blank">https://www.makemytrip.com</a></body></html>

How to strip this link to remove unwanted data (bs4)?

1 Answers1

Option#1

Option#2

Example

Output