1

I have the following Python code:

content = webpage.content
soup = Soup(content, 'html.parser')
app_url = scheme + app_identity.get_default_version_hostname() + '/'

for link in soup.find_all(href = True):
    if scheme in link['href']:
        link['href'] = link['href'].replace(scheme, app_url)
        logging.info('@MirrorPage | Updated link: %s', link['href'])
    else:
        link['href'] = input_url + link['href'].strip('/')
        logging.info('@MirrorPage | Updated asset: %s', link['href'])

# https://stackoverflow.com/questions/15455148/find-after-replacewith-doesnt-work-using-beautifulsoup/19612218#19612218
#soup = Soup(soup.renderContents())

# https://stackoverflow.com/questions/14369447/how-to-save-back-changes-made-to-a-html-file-using-beautifulsoup-in-python
content = soup.prettify(soup.original_encoding)

and render my HTML like so:

self.response.write(Environment().from_string(unicode(content, errors = 'ignore')).render())

Where app_identity is from Google App Engine, and jinja2 is used for templating/rendering. I've tried everything I can to write the modified HTML back to the content variable so that the proper webpage is rendered. How can I write any changes I make back properly? I have tried to use replaceWith where appropriate, but that doesn't seem to do the trick. Am I doing anything fundamentally wrong?

T145
  • 1,415
  • 1
  • 13
  • 33

2 Answers2

0

This function makes use of saving the html and returns it to be reprocessed as needed..

I tested it on stackoverflow and it saved the html with the replaced links/scheme.

I used {{description}} as the placeholder in template.html

it returned the opened html as a variable and was then passed back into a bs4 object and printed.

#!/usr/bin/python
# -*- coding: utf-8 -*-

import codecs
from xml.sax.saxutils import escape
import os

import jinja2
import requests
from bs4 import BeautifulSoup as bs4


def revise_links():
    url = 'https://stackoverflow.com/'
    template_name = 'template.html'
    file_name = 'replaced'

    scheme = 'stackoverflow'
    replace_with = 'mysite'

    r = requests.get(url)
    html_bytes = r.text
    soup = bs4(html_bytes, 'lxml')

    description_source = soup.findAll()

    for a in soup.findAll(href=True):
        if scheme in a['href']:
            a['href'] = a['href'].replace(scheme, replace_with)
            print a['href']
        else:
            a['href'] = url + a['href'].strip('/')

    # RENDER THE NEW HTML FILE    *

    def render(tpl_path, context):
        """Render html file with new data. Looks for the file in the current path"""
        (path, filename) = os.path.split(tpl_path)
    return jinja2.Environment(loader=jinja2.FileSystemLoader(path or './')).get_template(filename).render(context)

    # HTML DATA

    context = {'description': description_source}

    # Render the result

    result = render(template_name, context)

    # open the html

    # with open(file_name + '.html', 'a', encoding='utf-8') as f:
    #      f.write(result)  # write result

    # OPEN THE NEW HTML FILE READY TO REVISE **********************

    #  f1 = open(file_name + '.html', 'r', encoding='utf-8')
     # descript = f1.read()

    return result


content = revise_links()
soup = bs4(content, 'lxml')
print soup
johnashu
  • 2,167
  • 4
  • 19
  • 44
  • 1
    According to the BeautifulSoup documentation, you do not need to have a tag specified: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree – T145 Feb 11 '18 at 00:41
  • `with open('mirror.html', 'a') as f: IOError: [Errno 30] Read-only file system: 'mirror.html'` – T145 Feb 11 '18 at 01:13
  • you need write permissions wherever you are trying to create the file.. if using google app engine.. see here and modify the function with the google code for reading and writing.......https://cloud.google.com/appengine/docs/standard/python/googlecloudstorageclient/read-write-to-cloud-storage – johnashu Feb 11 '18 at 01:19
  • As I said in the post you deleted before, that's what I've been looking at. In my current environment `cloudstorage` cannot be used. – T145 Feb 11 '18 at 01:22
  • The system you use. You don't have write access? You can only save it in memory? – johnashu Feb 11 '18 at 09:24
  • Yes, it is a `read-only` environment by default. I just figured out how to change the roles for the `service-account` that manages the whole app, and changed its permissions to hopefully allow write access. We'll see if something happens. https://cloud.google.com/appengine/docs/standard/python/access-control – T145 Feb 11 '18 at 21:24
  • i ran the code without the read and write functions.. as long as you can read the template.html it returns as a variable..i edited the answer so you can see – johnashu Feb 11 '18 at 21:26
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/164934/discussion-between-johnashu-and-t145). – johnashu Feb 11 '18 at 21:33
0

Changing the permissions for the service account under the IMAP preferences on the Google App Project fixed writing changes. However, the base HTML is not rendering the full page i.e. when rendering a site like Google, the Javascript and styles don't seem to work. I can render the HTML simply using self.response.write(soup), but it doesn't solve this problem. I'll address this issue in a separate question as it involves actually retrieving (or scraping) the specified website.

T145
  • 1,415
  • 1
  • 13
  • 33