0

This is what the HTML looks like:

<div class="full-news none">
     Demo: <a href="https://www.lolinez.com/?https://www.makemytrip.com" 
    rel="external noopener noreferrer" target="_blank">https://www.makemytrip.com</a>
   <br/>

How can I remove this part from the href: https://www.lolinez.com/?, so that the final output becomes like this:

 <div class="full-news none">
         Demo: <a href="https://www.makemytrip.com" 
        rel="external noopener noreferrer" target="_blank">https://www.makemytrip.com</a>
       <br/>

I have tried using the decompose function of beautiful soup, but it completely removes the entire tag, How can this be fixed?

HedgeHog
  • 22,146
  • 4
  • 14
  • 36
Sainita
  • 332
  • 1
  • 4
  • 16
  • 1
    Can you expand your question and clarify why you can't just find&replace or another form of replace? – yuuuu Dec 01 '21 at 12:33
  • Why do manual find and replace buddy ? automating stuffs, thats the fun of python..i guess... – Sainita Dec 01 '21 at 12:35
  • You can automate find&replace in python. If that's a suitable solution then you can use replace() https://www.geeksforgeeks.org/python-string-replace/ – yuuuu Dec 01 '21 at 12:42
  • that i know, but how to navigate inside this html structure, then do the needed, let me know if you know the solution to it – Sainita Dec 01 '21 at 12:43
  • Without some more context on why you want to do that specifically (hence my original comment) its hard to say, but this may work for you: https://stackoverflow.com/questions/459981/beautifulsoup-modifying-all-links-in-a-piece-of-html You can select only certain links to modify using normal bs4 techniques. – yuuuu Dec 01 '21 at 12:45

1 Answers1

2

Note Without additional context I would narrow down to following approaches

Option#1

Replace your substring the string that you pass to BeautifulSoup constructor:

soup = BeautifulSoup(YOUR_STRING.replace('https://www.lolinez.com/?',''), 'lxml')
Option#2

Replace the substring in your soup you can select all the <a> that contains www.lolinez.com and replace the value of its href:

for x in soup.select('a[href*="www.lolinez.com"]'):
    x['href'] = x['href'].replace('https://www.lolinez.com/?','')

Example

import bs4, requests
from bs4 import BeautifulSoup

html='''
<a href="https://www.lolinez.com/?https://www.makemytrip.com" rel="external noopener noreferrer" target="_blank">https://www.makemytrip.com</a>
<a href="https://www.makemytrip.com" rel="external noopener noreferrer" target="_blank">https://www.makemytrip.com</a>
<a href="https://www.lolinez.com/?https://www.makemytrip.com" rel="external noopener noreferrer" target="_blank">https://www.makemytrip.com</a>
'''

soup = BeautifulSoup(html, 'lxml')

for x in soup.select('a[href*="www.lolinez.com"]'):
    x['href'] = x['href'].replace('https://www.lolinez.com/?','')
    
soup

Output

<html><body><a href="https://www.makemytrip.com" rel="external noopener noreferrer" target="_blank">https://www.makemytrip.com</a><a href="https://www.makemytrip.com" rel="external noopener noreferrer" target="_blank">https://www.makemytrip.com</a><a href="https://www.makemytrip.com" rel="external noopener noreferrer" target="_blank">https://www.makemytrip.com</a></body></html>
HedgeHog
  • 22,146
  • 4
  • 14
  • 36