Which encoding standard should I use,so that it supports most number of characters?

Question

I want to parse feeds

feed = feedparser.parse(url)
e = feed.entries[0]
summary=e['summary']

now when i am parsing the summary using BeautifulSoap.

self.summary = BeautifulSoup(summary.encode('utf-8')) #summary

I got errors.

Exception Type: UnicodeEncodeError Exception Value: 'ascii' codec can't encode character u'\xa3' in position 755: ordinal not in range(128)

the problem is with the character £4,000. I tried with :

summary.encode('utf-8','ignore'), summary.encode('ascii','ignore')

I spend lots of time to solve this,but still can't. So i am asking this question here.

If you let me know the encoding which supports most number of characters or any method to skipping that character, it will be very helpful.

utf-8 should be fine. Your error messages states that it's not possible to encode the character in ascii. — pypat, Jun 03 '13 at 11:21
Try printing `repr(summary)`. You need to know if what you have is bytes in some encoding or a `unicode` instance. If it's unicode, your `.encode()` should be working correctly, and should not be trying to use the ASCII codec. (`'\xa3'` is £ encoded in iso8859-1, so that's probably what you're starting with.) — Wooble, Jun 03 '13 at 11:25
can your question be a duplication for http://stackoverflow.com/questions/4197303/ascii-codec-error-in-beautifulsoup ? — oleg, Jun 03 '13 at 11:28
no you know i tried it with python it works but when I tried to produce it on django as html it has some problem. I want to update the question — suhailvs, Jun 03 '13 at 11:30
after using iso8859-1 i get the �4,000, ie £ changed to � ,still same error — suhailvs, Jun 03 '13 at 11:39
You shouldn't need `summary.encode` (and you shouldn't call `.encode` on a bytestring). what are `type(summary)`, `feed.encoding`, `feed.bozo`, `feed.bozo_exception`? — jfs, Jun 03 '13 at 11:52
sorry every thing working fine with encoding. utf-8 is great. actually my problem is some thing else. so i am closing the question — suhailvs, Jun 03 '13 at 11:54

score 1 · Accepted Answer · answered Jun 03 '13 at 12:03

1

I tried loading a html file with a utf8 pound sign in it, into a string "file"

This gave the same error as you are seeing

soup2=BeautifulSoup(file.encode('utf8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 17: ordinal not in range(128)

However, this seemed to work just fine

soup2=BeautifulSoup(file.decode('utf-8'))
soup2.find('p')
<p>£
</p>

I guess the concept of "encode" and "decode" is the other way around to what you are expecting. Hope this helps.

answered Jun 03 '13 at 12:03

Vorsprung

32,923
5
39
63

sorry sir, .encode('utf-8') just works fine. actually the problem is something else. You know I just put u in front of some string (ie something like "some text"+self.summary --> u"some text"+ self.summary) it works magically. – suhailvs Jun 03 '13 at 12:11
you should not call `.encode()` on a bytestring. Python 2 is too helpful and tries to decode it to Unicode before calling `.encode()`. It is simple: Unicode string -> encode(character encoding) -> bytestring and in reverse: bytestring -> decode(character encoding) -> Unicode string. – jfs Jun 03 '13 at 12:15
@suhail: you were bitten by the implicit mixing of bytes and Unicode strings. Python 2 is again too helpful and tries to upgrade the bytestring to Unicode in the expression: `b"some string" + u"unicode string"`. You shouldn't mix bytes and Unicode strings. Python 3 avoids these issues by forbidding all *implicit* conversions between bytes and Unicode. – jfs Jun 03 '13 at 12:24

Which encoding standard should I use,so that it supports most number of characters?

1 Answers1