1

[EDITED]

I´m using Google App Engine, and I´m trying to parse HTML content in order to extract some info. The code i´m using is:

from google.appengine.ext import webapp
from google.appengine.ext.webapp import util
from google.appengine.api import urlfetch
import BeautifulSoup

class MainHandler(webapp.RequestHandler):
    def get(self):
        url = 'http://ascodevida.com/ultimos'
        result = urlfetch.fetch(url=url)
        # ADVS de esta página.
        res = BeautifulSoup.BeautifulSoup(result.content).findAll('div', {'class' : 'box story'})
        ADVList = []
        for i in res:
            story = i.find('a', {'class' : 'advlink'}).string
            link = i.find('a', {'class' : 'advlink'})['href']
            ADVData = {
                'adv' : story,
                'link' : link
            }
            ADVList.append(ADVData)

        self.response.headers['Content-Type'] = 'text/html; charset=UTF-8'
        self.response.out.write(ADVList)

And this code this produces a response with strange chars. I´ve tried using prettify() and renderContent() methods of BeautifulSoup library, but is not effective.

Any solutions? Thanks again.

Jesús
  • 780
  • 8
  • 16
  • do you mean when visit res[0] it is okay, but when [x in res] the output is strange? could you show some example of the content? – springrider Feb 26 '12 at 14:36
  • 2
    Parsing HTML through regular expression or even string splitting/searching is totally wrong. Do it never. – Odomontois Feb 26 '12 at 14:39
  • @springrider Yes. The content looks like this: "mi hermana se hab\xeda sacado su port\xe1til de casa." (it is Spanish, strange chars are \xed = í, and \xe1 = á). – Jesús Feb 26 '12 at 15:23
  • @Odomontois What is the correct way to do this? – Jesús Feb 26 '12 at 15:25

2 Answers2

2

I'm a java developer and I'm using jsoup for HTML Parsing. I found similar one for python. This may help you & save your time.

http://www.crummy.com/software/BeautifulSoup/

Food for brain : Python regular expression for HTML parsing (BeautifulSoup)

Community
  • 1
  • 1
Dipin
  • 1,085
  • 6
  • 19
  • Thanks! This is way to parse HTML content and extract only the elements i need. – Jesús Feb 26 '12 at 14:39
  • Humm... It´s continue showing strange chars, but now i can get the value easily. What i can do? I´ve readed the documentation of BeautifulSoup, but it doesn´t encode the document correctly... – Jesús Feb 26 '12 at 15:28
  • can u give me the URL? or sample content of that URL? – Dipin Feb 26 '12 at 15:51
  • I´m working on localhost. I can paste all the code and the response, if you want. – Jesús Feb 26 '12 at 15:59
  • paste the part of the response , that you are trying to parse. – Dipin Feb 26 '12 at 16:02
  • I´ve edited the original post, showing the new code i´m using. – Jesús Feb 26 '12 at 16:09
  • can u try something like this..`story = i.find('a', {'class' : 'advlink'}).string.encode('utf8')` [Unicode in python](http://stackoverflow.com/questions/752998/how-to-work-with-unicode-in-python) – Dipin Feb 26 '12 at 18:46
  • i think u need to look at this..http://boodebr.org/main/python/all-about-python-and-unicode – Dipin Feb 26 '12 at 19:32
  • i´ve looked it, but it doens´t fix my problem. Thanks you so much :-) – Jesús Feb 27 '12 at 14:49
0

I think you are printing the list directly, which calles repr, the default output is in hex format (like \xe1).

you could try this:

>>> s = u"Leer más"
>>> repr(s)
"'Leer m\\xc3\\xa1s'"

but print statement will try to decode the string:

>>> print s
Leer más

if you want the correct result, just avoid the default behavior of list and handle every item by yourself.

springrider
  • 470
  • 1
  • 6
  • 19
  • That´s correct! The problem comes with the Google App Engine framework: I can´t use print to show the result in the browser. I must use "self.response.out.write(u"Leer más") to render the responser to the browser. Also i´ve tried to "print" the content in a var ( s = "%s" % u"Leer más") and nothing seems to work. – Jesús Feb 27 '12 at 14:52
  • do you mean "s = "%s" % u"Leer más"" not working either? I tested it and it's okay, http://mytestapp12345.appspot.com/ did you add "#coding=utf-8" in the file header? – springrider Feb 28 '12 at 02:19
  • I´ve added the encode info in the file header, but it doesn´t works. I think it´s problem of the page that i´m rendering, which has no encoding, and makes App Engine to get "crazy" with strange chars. – Jesús Mar 02 '12 at 18:05