How do I get as output the list of LINKS only? I have tried other solutions with both beautifulsoup and selennium but they still give me a result very similar to the one I am currently getting, which is the href of the link AND the anchor text. I tried to use urlparse as some older answers suggested, but it seems that that module is not in use anymore and I am confused about the whole thing. This is my code, currently outtputting the link AND the anchor text, which is NOT what I want:
import requests, re
from bs4 import BeautifulSoup
headers = {'User-agent':'Mozilla/5.0'}
page = requests.get('https://www.google.com/search?q=Tesla',headers=headers)
soup = BeautifulSoup(page.content,'lxml')
global serpUrls
serpUrls = []
links = soup.findAll('a')
for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
#print(re.split(":(?=http)",link["href"].replace("/url?q=","")))
serpUrls.append(link)
print(serpUrls[0:2])
xmasRegex = re.compile(r"""((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))*))+(?:(([^\s()<>]+|(([^\s()<>]+)))*)|[^\s`!()[]{};:'".,<>?«»“”‘’]))""", re.DOTALL)
mo = xmasRegex.findall('[<a href="/url?q=https://www.teslamotors.com/&sa=U&ved=0ahUKEwjvzrTyxvTKAhXHWRoKHUjlBxwQFggUMAA&usg=AFQjCNG1nvN_Z0knKTtEah3whTIObUAhcg"><b>Tesla</b> Motors | Premium Electric Vehicles</a>, <a class="_Zkb" href="/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:rzPQodkDKYYJ:https://www.teslamotors.com/%252BTesla%26gws_rd%3Dcr%26hl%3Des%26%26ct%3Dclnk&sa=U&ved=0ahUKEwjvzrTyxvTKAhXHWRoKHUjlBxwQIAgXMAA&usg=AFQjCNEZ40VWO_fFDjXH09GakUOgODNlHg">En caché</a>]')
print(mo)
I only want the "http://urloflink.com", not the whole line of code. Any way to do this? Thanks!
Output looks like this:
[<a href="/url?q=https://www.teslamotors.com/&sa=U&ved=0ahUKEwjI39vl2_TKAhXFWxoKHRX-CFgQFggUMAA&usg=AFQjCNG1nvN_Z0knKTtEah3whTIObUAhcg"><b>Tesla</b> Motors | Premium Electric Vehicles</a>, <a class="_Zkb" href="/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:rzPQodkDKYYJ:https://www.teslamotors.com/%252BTesla%26gws_rd%3Dcr%26hl%3Des%26%26ct%3Dclnk&sa=U&ved=0ahUKEwjI39vl2_TKAhXFWxoKHRX-CFgQIAgXMAA&usg=AFQjCNEZ40VWO_fFDjXH09GakUOgODNlHg">En caché</a>]
[('https://www.teslamotors.com/&sa=U&ved=0ahUKEwjvzrTyxvTKAhXHWRoKHUjlBxwQFggUMAA&usg=AFQjCNG1nvN_Z0knKTtEah3whTIObUAhcg"', '', '', '', '', '', '', '', ''), ('http://webcache.googleusercontent.com/search%3Fq%3Dcache:rzPQodkDKYYJ:https://www.teslamotors.com/%252BTesla%26gws_rd%3Dcr%26hl%3Des%26%26ct%3Dclnk&sa=U&ved=0ahUKEwjvzrTyxvTKAhXHWRoKHUjlBxwQIAgXMAA&usg=AFQjCNEZ40VWO_fFDjXH09GakUOgODNlHg"', '', '', '', '', '', '', '', '')]