3

I scrapped a webpage with BeautifulSoup. I got great output except parts of the list look like this after getting the text:

list = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']

My question now is how to get rid or replace these double backslashes with the special characters they are.

If i print the first the first element of the example list the output looks like

print list[0]
that\u2019s

I already read a lot of other questions / threads about this topic but I ended up being even more confused, as I am a beginner considering unicode / encoding / decoding.

I hope that someone could help me with this issue.

Thanks! MG

mgruber
  • 751
  • 1
  • 9
  • 26
  • 1
    @mgruber remember to accept an answer if it helped you – eLRuLL Jan 04 '17 at 17:17
  • Unless the web page literally contains unicode escape sequences like that (*that\u2019s* instead of *that’s*), beautifulsoup will not return strings in that form. It will return the text without escaping anything. How are you getting those strings? – roeland Jan 04 '17 at 20:04
  • I performed a regex in the same time and it seems like that this was the problem. Do you have any ad hoc explanations for that? – mgruber Jan 05 '17 at 08:35
  • Have you scraped sub-parts of a JSON structure? If so you should instead try to read the whole JSON value, parse it using `json.loads` and access the pieces of it you want from there. – bobince Jan 05 '17 at 11:02
  • I did access them by first loading it with `data = json.load(name_of_file)` and then I only got the stuff I want with `raw = data['html']`.I assume that the next step where I tried to get rid of comments (still got some left after using BeautifulSoup in some cases) with `raw = re-sub('(?s)', '',str(raw))` got my output messy. – mgruber Jan 05 '17 at 13:17

2 Answers2

12

Since you are using Python 2 there, it is simply a matter of re-applying the "decode" method - using the special codec "unicode_escape". It "sees" the "physical" backlashes and decodes those sequences proper unicode characters:

data =  [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']

result = [part.decode('unicode_escape') for part in data]

To aAnyone getting here using Python3: in that version can not apply the "decode" method to the str objects delivered by beautifulsoup - one has to first re-encode those to byte-string objects, and then decode with the uncode_escape codec. For these purposes it is usefull to make use of the latin1 codec as the transparent encoding: all bytes in the str object are preserved in the new bytes object:

result = [part.encode('latin1').decode('unicode_escape') for part in data]
jsbueno
  • 99,910
  • 10
  • 151
  • 209
  • AttributeError: 'str' object has no attribute 'decode' – Jeanderson Candido Jan 04 '17 at 15:29
  • 1
    You are using Python 3, and the OP and this example are both in Python2. (In python 2, to start with, a `u" "` prefixed string is an unicode object, not an str). Please, the voting system is not meant for personal vendetas - it is meant for marking incorrect answers. – jsbueno Jan 04 '17 at 15:31
  • I don't see in the question any reference for Python version – Jeanderson Candido Jan 04 '17 at 15:32
  • 1
    That is just because you are 't used to Python's different versions. There is the "print" statement instead of a function, among other clues. – jsbueno Jan 04 '17 at 15:34
4

the problem here is that the site ended up double encoding those unicode arguments, just do the following:

ls = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']

ls = map(lambda x: x.decode('unicode-escape'), ls)

now you have a list with properly unicode encoded strings:

for a in ls:
   print a
eLRuLL
  • 18,488
  • 9
  • 73
  • 99
  • I first tried your solution on my whole list and it didn´t work. Then I copied your 4 code lines into a script and tried to run it and it threw the following error: `UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 4: character maps to ` – mgruber Jan 04 '17 at 15:19
  • you should include your full example in order to understand your question better. That new error is happening because you have strings inside your list that don't have double backslashes, so they are already decoded. You'll have to remove the good ones before, or use a `try:except` function – eLRuLL Jan 04 '17 at 15:26
  • 2
    This is more likely a problem when you try to _print_ the decoded string in a terminal which can't map properly this char. Check your error message for the line where the error occurs. This answer is correct. – jsbueno Jan 04 '17 at 15:26
  • 1
    If you are on windows you simply won't be able to see the correct output for this on the CMD terminal - beacuase it uses an encoding with only 256 characters that does not include the "\u2019" char. Try saving your results to an utf-8 encoded file and opening that in an editor instead. – jsbueno Jan 04 '17 at 15:29
  • I used the excact 4 lines (including "ls") as posted above from @eLRuLL and I only see strings with \\ in there?! Thats the things thats written before the UnicodeEncodeError : ` File "C:\Python27\lib\encodings\cp437.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_map)` – mgruber Jan 04 '17 at 15:31
  • Okay I understand the problem now ... Thanks a lot! Which answer is the faster / better alternative? @eLRuLL or @jsbueno? – mgruber Jan 04 '17 at 15:40
  • Both are equivalent - I find my way more readable, but it is mostly a matter of personal preference. My syntax is more specific for Python, and using `map` can be harder to read by other people used with Python, but far easier to read by people not used to Python. i.e., if there will be people with other language backgrounds using your script, go for the `map` solution. – jsbueno Jan 04 '17 at 15:43
  • Okay, one last question: How do I output the result into a file so each char is readable? e.g. 'That´s' – mgruber Jan 04 '17 at 15:48
  • @mgruber you can try my answer to make the string human-readable – Jeanderson Candido Jan 04 '17 at 15:53
  • 1
    @mgruber you just need to encode it to `utf-8`. Check [this answer](http://stackoverflow.com/a/6048203/858913) – eLRuLL Jan 04 '17 at 15:55