I'm trying to scrape a website using lxml.
What I'm trying to do is, get the html of the web page, get all of the stylesheet links on the page, and then replace those links with some updated links so I can insert all of the html with the new updated links into a new html file.
The code I have so far is this:
import requests
from lxml import etree
from lxml import html
page = requests.get('https://www.flashscore.co.uk/basketball/')
root = html.fromstring(page.content)
def get_original_list():
original_list = []
stylesheets = root.xpath('//link')
for link in stylesheets:
if link.get('href'):
if link.get('href').startswith('/') == True:
original_list.append(link.get('href'))
return original_list
def get_new_list():
original_list = get_original_list()
new_list = []
for x in original_list:
new_list.append(x.lstrip('/'))
return new_list
def replace_links(root):
og_list = get_original_list()
n_list = get_new_list()
for o, n in zip(og_list, n_list):
print(o, n)
get_tree = etree.tostring(root).decode()
get_tree.replace(o, n)
print(get_tree)
replace_links(root)
I'm stuck on replacing the links. How can I get the html of a page and replace the href of the links so I can open a file and save the html file.