I am trying to make an offline copy of this website: ieeghn. Part of this task is to download all css/js that being referred to using Beautiful Soup and modify any external link to this newly downloaded resource.
At the moment I simply use string replace
method. But I don't think this is effective, as I do this inside a loop, snippet below:
local_content = ''
for res in soup.findAll('link', {'rel': 'stylesheet'}):
if not str(res['href']).startswith('data:'):
original_res = res['href']
res['href'] = some_function_to_download_css()
local_content = local_content.replace(original_res, res['href'])
I only save resource for non-embedding resource that start with data:
. But the problem is, that local_content = local_content.replace(original_res, res['href'])
may lead to the problem that I only able to modify one external resource into local resource. The rest still refer to online version of the resource.
I am guessing that because local_content is a very long string (have a look at the ieeghn source), this didn't work out well.
How do you properly replace content of a string for a given pattern? Or do I have to store this first to a file and modify it there?
EDITED I found the problem was in this line of code:
original_res = res['href']
BSoup will somehow sanitized the href string. In my case, &
will be changed to &
. As I am trying to replace the original href
into a newly downloaded local file, str.replace()
simply won't find this original value. Either I have to found a way to have original HREF or simply handle this case. Got to say, having the original HREF is the best way