1

I want to parse feeds

feed = feedparser.parse(url)
e = feed.entries[0]
summary=e['summary']

now when i am parsing the summary using BeautifulSoap.

self.summary = BeautifulSoup(summary.encode('utf-8')) #summary

I got errors.

Exception Type: UnicodeEncodeError Exception Value: 'ascii' codec can't encode character u'\xa3' in position 755: ordinal not in range(128)

the problem is with the character £4,000. I tried with :

summary.encode('utf-8','ignore'), summary.encode('ascii','ignore')

I spend lots of time to solve this,but still can't. So i am asking this question here.

If you let me know the encoding which supports most number of characters or any method to skipping that character, it will be very helpful.

Mahesh.D
  • 1,691
  • 2
  • 23
  • 49
suhailvs
  • 20,182
  • 14
  • 100
  • 98
  • 4
    utf-8 should be fine. Your error messages states that it's not possible to encode the character in ascii. – pypat Jun 03 '13 at 11:21
  • 3
    Try printing `repr(summary)`. You need to know if what you have is bytes in some encoding or a `unicode` instance. If it's unicode, your `.encode()` should be working correctly, and should not be trying to use the ASCII codec. (`'\xa3'` is £ encoded in iso8859-1, so that's probably what you're starting with.) – Wooble Jun 03 '13 at 11:25
  • can your question be a duplication for http://stackoverflow.com/questions/4197303/ascii-codec-error-in-beautifulsoup ? – oleg Jun 03 '13 at 11:28
  • no you know i tried it with python it works but when I tried to produce it on django as html it has some problem. I want to update the question – suhailvs Jun 03 '13 at 11:30
  • after using iso8859-1 i get the �4,000, ie £ changed to � ,still same error – suhailvs Jun 03 '13 at 11:39
  • You shouldn't need `summary.encode` (and you shouldn't call `.encode` on a bytestring). what are `type(summary)`, `feed.encoding`, `feed.bozo`, `feed.bozo_exception`? – jfs Jun 03 '13 at 11:52
  • sorry every thing working fine with encoding. utf-8 is great. actually my problem is some thing else. so i am closing the question – suhailvs Jun 03 '13 at 11:54

1 Answers1

1

I tried loading a html file with a utf8 pound sign in it, into a string "file"

This gave the same error as you are seeing

soup2=BeautifulSoup(file.encode('utf8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 17: ordinal not in range(128)

However, this seemed to work just fine

soup2=BeautifulSoup(file.decode('utf-8'))
soup2.find('p')
<p>£
</p>

I guess the concept of "encode" and "decode" is the other way around to what you are expecting. Hope this helps.

Vorsprung
  • 32,923
  • 5
  • 39
  • 63
  • sorry sir, .encode('utf-8') just works fine. actually the problem is something else. You know I just put u in front of some string (ie something like "some text"+self.summary --> u"some text"+ self.summary) it works magically. – suhailvs Jun 03 '13 at 12:11
  • you should not call `.encode()` on a bytestring. Python 2 is too helpful and tries to decode it to Unicode before calling `.encode()`. It is simple: Unicode string -> encode(character encoding) -> bytestring and in reverse: bytestring -> decode(character encoding) -> Unicode string. – jfs Jun 03 '13 at 12:15
  • @suhail: you were bitten by the implicit mixing of bytes and Unicode strings. Python 2 is again too helpful and tries to upgrade the bytestring to Unicode in the expression: `b"some string" + u"unicode string"`. You shouldn't mix bytes and Unicode strings. Python 3 avoids these issues by forbidding all *implicit* conversions between bytes and Unicode. – jfs Jun 03 '13 at 12:24