0

I am trying to make an offline copy of this website: ieeghn. Part of this task is to download all css/js that being referred to using Beautiful Soup and modify any external link to this newly downloaded resource.

At the moment I simply use string replace method. But I don't think this is effective, as I do this inside a loop, snippet below:

local_content = '' 
for res in soup.findAll('link', {'rel': 'stylesheet'}):
            if not str(res['href']).startswith('data:'):
                original_res = res['href']
                res['href'] = some_function_to_download_css()
                local_content = local_content.replace(original_res, res['href'])

I only save resource for non-embedding resource that start with data:. But the problem is, that local_content = local_content.replace(original_res, res['href']) may lead to the problem that I only able to modify one external resource into local resource. The rest still refer to online version of the resource.

I am guessing that because local_content is a very long string (have a look at the ieeghn source), this didn't work out well.

How do you properly replace content of a string for a given pattern? Or do I have to store this first to a file and modify it there?

EDITED I found the problem was in this line of code:

 original_res = res['href']

BSoup will somehow sanitized the href string. In my case, & will be changed to &. As I am trying to replace the original href into a newly downloaded local file, str.replace() simply won't find this original value. Either I have to found a way to have original HREF or simply handle this case. Got to say, having the original HREF is the best way

swdev
  • 4,997
  • 8
  • 64
  • 106
  • Could you provide almost the full functioning code which works? So that I can have a look... – wenzul Oct 25 '14 at 00:49

1 Answers1

1

You're already replacing the content, in a way...

res['href'] = some_function_to_download_css()

...updates the href attribute of the res node in BeautifulSoup's representation of the HTML tree.

To make it more efficient, you could cache the URLs of CSS files you've already downloaded, and consult the cache before downloading the file. Once you're done (and if you're OK with BS's attribute ordering/indentation/etc.), you can get the string representation of the tree with str(soup).

Reference: http://beautiful-soup-4.readthedocs.org/en/latest/#changing-tag-names-and-attributes

PlasmaSauna
  • 235
  • 1
  • 5
  • Actually, I have already done just that and realize it. It' just that, .. somehow for long string (of HTML content). The local_content won't change. I think I will post a full code that can be tested. It may also help me track down the bug – swdev Oct 26 '14 at 15:52
  • Hi @PlasmaSaunda: I edited my question. Any suggestion? – swdev Oct 27 '14 at 22:38
  • @swdev have you tried the HTMLParser class's `unescape` method? http://stackoverflow.com/a/12614706 – PlasmaSauna Nov 15 '14 at 03:02