167

This will surely be an easy one but it is really bugging me.

I have a script that reads in a webpage and uses Beautiful Soup to parse it. From the soup I extract all the links as my final goal is to print out the link.contents.

All of the text that I am parsing is ASCII. I know that Python treats strings as unicode, and I am sure this is very handy, just of no use in my wee script.

Every time I go to print out a variable that holds 'String' I get [u'String'] printed to the screen. Is there a simple way of getting this back into just ascii or should I write a regex to strip it?

Freek de Bruijn
  • 3,552
  • 2
  • 22
  • 28
gnuchu
  • 2,079
  • 3
  • 16
  • 8
  • 2
    possible duplicate of the much more clearly worded question (and answer): https://stackoverflow.com/q/2464959/1390788 – Terrabits Jun 21 '20 at 23:09
  • Does this answer your question? [What's the u prefix in a Python string?](https://stackoverflow.com/questions/2464959/whats-the-u-prefix-in-a-python-string) – Terrabits Jun 21 '20 at 23:10

9 Answers9

130

[u'ABC'] would be a one-element list of unicode strings. Beautiful Soup always produces Unicode. So you need to convert the list to a single unicode string, and then convert that to ASCII.

I don't know exaxtly how you got the one-element lists; the contents member would be a list of strings and tags, which is apparently not what you have. Assuming that you really always get a list with a single element, and that your test is really only ASCII you would use this:

 soup[0].encode("ascii")

However, please double-check that your data is really ASCII. This is pretty rare. Much more likely it's latin-1 or utf-8.

 soup[0].encode("latin-1")


 soup[0].encode("utf-8")

Or you ask Beautiful Soup what the original encoding was and get it back in this encoding:

 soup[0].encode(soup.originalEncoding)
oefe
  • 19,298
  • 7
  • 47
  • 66
  • 6
    You actually don't have to do the encoding, because the OP is only seeing the string repr because thats how you see anything when you print a list. soup[0] will be enough to show the str instead of the repr, showing the contents of the string and not the quote and unicode modifier. – ironfroggy Mar 01 '09 at 13:36
  • 2
    You shouldn't encode the text represented as Unicode to bytes in most cases: you should print Unicode directly in Python: [`print(', '.join([u'ABC' , u'...']))`](http://stackoverflow.com/a/36891685/4279) – jfs Jun 12 '16 at 17:20
27

You probably have a list containing one unicode string. The repr of this is [u'String'].

You can convert this to a list of byte strings using any variation of the following:

# Functional style.
print map(lambda x: x.encode('ascii'), my_list)

# List comprehension.
print [x.encode('ascii') for x in my_list]

# Interesting if my_list may be a tuple or a string.
print type(my_list)(x.encode('ascii') for x in my_list)

# What do I care about the brackets anyway?
print ', '.join(repr(x.encode('ascii')) for x in my_list)

# That's actually not a good way of doing it.
print ' '.join(repr(x).lstrip('u')[1:-1] for x in my_list)
ddaa
  • 52,890
  • 7
  • 50
  • 59
  • 1
    Please, avoid such horrors as `repr(x).lstrip('u')[1:-1]`. Use something like: `print ", ".join(my_list)` instead, to format a list of Unicode strings. – jfs Apr 27 '16 at 13:46
  • 2
    The comment, it says: "That's actually not a good way of doing it". It's just here for the lolz! – ddaa Apr 27 '16 at 14:54
13
import json, ast
r = {u'name': u'A', u'primary_key': 1}
ast.literal_eval(json.dumps(r)) 

will print

{'name': 'A', 'primary_key': 1}
osmjit
  • 381
  • 3
  • 10
10

If accessing/printing single element lists (e.g., sequentially or filtered):

my_list = [u'String'] # sample element
my_list = [str(my_list[0])]
gevang
  • 4,994
  • 25
  • 33
5

pass the output to str() function and it will remove the unicode output u''. also by printing the output it will remove the u'' tags from it.

waweru
  • 1,024
  • 14
  • 16
4

[u'String'] is a text representation of a list that contains a Unicode string on Python 2.

If you run print(some_list) then it is equivalent to
print'[%s]' % ', '.join(map(repr, some_list)) i.e., to create a text representation of a Python object with the type list, repr() function is called for each item.

Don't confuse a Python object and its text representationrepr('a') != 'a' and even the text representation of the text representation differs: repr(repr('a')) != repr('a').

repr(obj) returns a string that contains a printable representation of an object. Its purpose is to be an unambiguous representation of an object that can be useful for debugging, in a REPL. Often eval(repr(obj)) == obj.

To avoid calling repr(), you could print list items directly (if they are all Unicode strings) e.g.: print ",".join(some_list)—it prints a comma separated list of the strings: String

Do not encode a Unicode string to bytes using a hardcoded character encoding, print Unicode directly instead. Otherwise, the code may fail because the encoding can't represent all the characters e.g., if you try to use 'ascii' encoding with non-ascii characters. Or the code silently produces mojibake (corrupted data is passed further in a pipeline) if the environment uses an encoding that is incompatible with the hardcoded encoding.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
4

Do you really mean u'String'?

In any event, can't you just do str(string) to get a string rather than a unicode-string? (This should be different for Python 3, for which all strings are unicode.)

hichris123
  • 10,145
  • 15
  • 56
  • 70
Andrew Jaffe
  • 26,554
  • 4
  • 50
  • 59
  • I should have been clearer. I am using str() but still getting output like below when I print. [u'ABC'] [u'DEF'] [u'GHI'] [u'JKL'] The data is stripped as text from a webpage, then inserted into a database (Google Appstore), then retrieved and printed. – gnuchu Mar 01 '09 at 11:09
3

Use dir or type on the 'string' to find out what it is. I suspect that it's one of BeautifulSoup's tag objects, that prints like a string, but really isn't one. Otherwise, its inside a list and you need to convert each string separately.

In any case, why are you objecting to using Unicode? Any specific reason?

sykora
  • 96,888
  • 11
  • 64
  • 71
  • I've been looking at BeautifulSoup since the last few days. I couldn't figure out how gnuchu would get u['string'] not [u'String']. His comment to Andrew Jaffe seems to prove it is a list. – batbrat Mar 01 '09 at 11:54
-3

encode("latin-1") helped me in my case:

facultyname[0].encode("latin-1")
Undo
  • 25,519
  • 37
  • 106
  • 129