1

I'm a little confused on how to unescape characters in python. I am parsing some HTML using BeautifulSoup, and when I retrieve the text content it looks like this:

\u00a0\n\n\n\r\nState-of-the-art security and 100% uptime SLA.\u00a0\r\n\n\n\r\nOutstanding support

I'd like for it to look like this:

State-of-the-art security and 100% uptime SLA. Outstanding support

Here is my code below:

    self.__page = requests.get(url)
    self.__soup = BeautifulSoup(self.__page.content, "lxml")
    self.__page_cleaned = self.__removeTags(self.__page.content) #remove script and style tags
    self.__tree = html.fromstring(self.__page_cleaned) #contains the page html in a tree structure
    page_data = {}
    page_data["content"] =  self.__tree.text_content()

How do I remove those encoded backslashed characters? I've looked everywhere and nothing has worked for me.

Zerry Hogan
  • 183
  • 2
  • 14

2 Answers2

3

You can convert those escape sequences to proper text using the codecs module.

import codecs

s = r'\u00a0\n\n\n\r\nState-of-the-art security and 100% uptime SLA.\u00a0\r\n\n\n\r\nOutstanding support'

# Convert the escape sequences
z = codecs.decode(s, 'unicode-escape')
print(z)
print('- ' * 20)

# Remove the extra whitespace
print(' '.join(z.split()))       

output

    [several blank lines here]
 



State-of-the-art security and 100% uptime SLA. 



Outstanding support
- - - - - - - - - - - - - - - - - - - - 
State-of-the-art security and 100% uptime SLA. Outstanding support

The codecs.decode(s, 'unicode-escape') function is quite versatile. It can handle simple backslash escapes, like those newline and carriage return sequences (\n and \r), but its main strength is handling Unicode escape sequences, like the \u00a0, which is just a nonbreak space char. But if your data had other Unicode escapes in it, like those for foreign alphabetic chars or emojis, it would handle them too.


As Evpok mentions in a comment, this won't work if the text string contains actual Unicode characters as well as Unicode \u or \U escape sequences.

From the codecs docs:

unicode_escape

Encoding suitable as the contents of a Unicode literal in ASCII-encoded Python source code, except that quotes are not escaped. Decodes from Latin-1 source code. Beware that Python source code actually uses UTF-8 by default.

Also see the docs for codecs.decode.

PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
  • This fails with strings containing unicode characters since `unicode-escape` assumes latin1 input – Evpok Apr 03 '19 at 09:21
  • 1
    @Evpok Good point! I've updated my answer. I must admit that a text string containing a mix of Unicode chars & Unicode escape sequences would be rather strange, but I guess I've seen all sorts of strangely mangled Unicode. ;) At least Python 3 is a lot better in that regard than Python 2. – PM 2Ring Apr 03 '19 at 10:12
  • It actually bit me while trying to unescape `"l\'œil"`, which doesn't escape unicode, but still has escapes. – Evpok Apr 03 '19 at 11:35
  • @Evpok If that's a literal string in your Python script, then it doesn't need unescaping. OTOH, if that's data you've read in, so you actually have `r"l\'œil"`, which is equivalent to `"l\\'œil"`, then yes, `unicode-escape` decoding won't help. There are some suggestions at https://stackoverflow.com/q/1885181/4014959 but some of those answers only apply to Python 2. – PM 2Ring Apr 03 '19 at 11:57
  • Yes, thank you, the ast solution there was actually what I ended up using :-) – Evpok Apr 03 '19 at 12:33
1

You could use regular expressions:

import re

s = '\u00a0\n\n\n\r\nState-of-the-art security and 100% uptime SLA.\u00a0\r\n\n\n\r\nOutstanding support'
s = ' '.join(re.findall(r"[\w%\-.']+", s))

print(s) #output: State-of-the-art security and 100% uptime SLA. Outstanding support

re.findall("exp", s) returns a list of all substrings of s which match the pattern "exp". In the case of "[\w]+" all combinations of letters or numbers (no hex string like "\u00a0"):

['State', 'of', 'the', 'art', 'security', 'and', '100', 'uptime', 'SLA', 'Outstanding', 'support'] 

You can include characters by adding them to the expression like so:

re.findall(r"[\w%.-']+", s)    # added "%", "." and "-" ("-"needs to be escaped by "\")

' '.join(s) returns a string of all elements seperated by the string in the quotes (in this case a space).

upe
  • 1,862
  • 1
  • 19
  • 33
  • Thanks man that worked, can you explain what's going on? Also, I didn't want to remove the slashes so I don't need that part. I will see what other solutions can be offered before I accept your answer – Zerry Hogan Nov 03 '17 at 22:09
  • 1
    `s = '\u00a0\n\n\n\r\nState-of-the-art security and 100% uptime SLA.\u00a0\r\n\n\n\r\nOutstanding support'` isn't the data shown in the question. That's just normal text. You should be using a raw string to put those backslash sequences into your code as a literal string. – PM 2Ring Nov 03 '17 at 22:18