2

Following this thread solution, I have managed to get a bunch of lists that each looks like:

[u'\u05ea\u05d0\u05de\u05d9\u05df \u05dc\u05d9']

I assume that those are unicode character but for some reason, I can't convert them back into Hebrew.

I tried the suggested solution in the comments in the link. I also tried to use ''.join but it didn't work. The error I get is:

Error Type: exceptions.UnicodeEncodeError 22:42:15 T:2806414192
M:2425589760 ERROR: Error Contents: 'ascii' codec can't encode
characters in position 0-4: ordinal not in range(128)

I tried to wrap stuff in unicode() but all I got is the same as the example above.

How do I achieve that?

Note:
I am trying to parse this link.

Edit:
I am trying to convert the list into string using join and then print it. Here is the relevant pice of code:

soup = BeautifulStoneSoup(link, convertEntities=BeautifulStoneSoup.XML_ENTITIES)
    programs = soup('ul')
    for i,prog in enumerate(programs):
        if i==(4+getLetterValue(name)):
            j = 0
            while j < len(prog('li')):
                li = prog('li')[j]
        link = li('a')[0]
        url = link['href']
                text = link.contents
                print ''.join(text)

link is a string. and getLetterValue(name) returns an integer which tells what is the position in the html document.

Community
  • 1
  • 1
Yotam
  • 10,295
  • 30
  • 88
  • 128
  • 1
    What do you mean by "convert them back into Hebrew."? E.g. want to write them into a utf-8 encoded file? – bpgergo Aug 29 '11 at 19:51
  • 1
    That already *is* a unicode string in that list, hence the `u'...`. Please elaborate what you mean by "convert them back into Hebrew". – Ross Patterson Aug 29 '11 at 19:51
  • can you post some code for what you are trying to do? Assigning the list above to a variable and printing it gives תאמין לי which looks like hebrew to me... – Fredrik Pihl Aug 29 '11 at 19:51
  • 1
    For me this prints fine `[u'\u05ea\u05d0\u05de\u05d9\u05df \u05dc\u05d9'] >>> print l[0] תאמין לי` – bpgergo Aug 29 '11 at 19:51
  • I want to display them on the string via xbmc.org plugin. For now, the problem is with print which, in effect, print the stuff to a file and not to the screen – Yotam Aug 29 '11 at 19:53
  • Please include a code sample of how you use a different string. – Ross Patterson Aug 29 '11 at 19:54
  • That's not a code sample of how you'd use a different string to do what you want to do. IOW, how would you normally put a string on the screen that isn't working with this string? – Ross Patterson Aug 29 '11 at 20:00
  • @Rossa Patterson: I'm not sure what you meant. The solution you wrote me doesn't work. This could be originated in the way that xbmc handles string. – Yotam Aug 29 '11 at 20:09
  • "how would you normally put a string on the screen" – Ross Patterson Aug 29 '11 at 20:11

1 Answers1

3

This is a unicode string, it is in Hebrew and you can even print it directly on a Python interactive shell. e.g.:

>>> print u'\u05ea\u05d0\u05de\u05d9\u05df \u05dc\u05d9'
תאמין לי

If you really need to convert it to a raw string of bytes (a str object) for some reason, you have to specify the encoding of the byte string because text can represented in many different encodings.

Short answer: assuming you want to use UTF-8 to encode the text, you can use:

your_unicode_text.encode('utf-8')

If you are going to use a different encoding, just change the encoding name above.

For a reference on how Python deals with Unicode text and common problems, see: http://docs.python.org/howto/unicode.html

See also this answer for another short explanation of Unicode and string encodings.

Community
  • 1
  • 1
ehabkost
  • 411
  • 3
  • 5