How to eliminate the ☎ unicode?

Question

During web scraping and after getting rid of all html tags, I got the black telephone character \u260e in unicode (☎). But unlike this response I do want to get rid of it too.

I used the following regular expressions in Scrapy to eliminate html tags:

pattern = re.compile("<.*?>|&nbsp;|&amp;",re.DOTALL|re.M)

Then I tried to match \u260e and I think I got caught by the backslash plague. I tried unsuccessfully this patterns:

pattern = re.compile("<.*?>|&nbsp;|&amp;|\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>|&nbsp;|&amp;|\\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>|&nbsp;|&amp;|\\\\u260e",re.DOTALL|re.M)

None of this worked and I still have \u260e as an output. How can I make this disappear?

As mentioned on your link, raw strings are the antidote to backslash plague. It may not be the most relevant thing here, but keep it in mind. — Mike, May 06 '13 at 15:47
In line with @Rubens answer, the problem you're facing is that regular strings *aren't* properly unicode encoded, unless you prefix the with `u`. — jpaugh, May 06 '13 at 15:56
＋1 Because this is the first time I've seen a ☎ in a URL — , Jul 10 '15 at 00:33

Rubens · Accepted Answer · 2013-05-06T15:43:33.023

7

Using Python 2.7.3, the following works fine for me:

import re

pattern = re.compile(u"<.*?>|&nbsp;|&amp;|\u260e",re.DOTALL|re.M)
s = u"bla ble \u260e blo"
re.sub(pattern, "", s)

Output:

u'bla ble  blo'

As pointed by @Zack, this works due to the fact that the string is now in unicode, i.e., the string is already converted, and the sequence of characters \u260e is now the -- probably -- two bytes used to write that little black phone ☎ (:

Once both the string to be searched and the regular expression have the black phone itself, and not the sequence of characters \u260e, they both match.

edited May 06 '13 at 15:43

answered May 06 '13 at 15:24

Rubens

14,478
11
63
92

7

Good answer but you should maybe emphasize that the key difference here is those `u` prefixes on all the strings, i.e. operating on Unicode rather than byte strings. – zwol May 06 '13 at 15:37
I guess that u prefix made some difference. It worked, thanks. – rafa May 06 '13 at 15:48

score 4 · Answer 2 · edited May 23 '17 at 12:02

4

If your string is already unicode, there's two easy ways. The second one will affect more than just the ☎, obviously.

>>> import string                                   
>>> foo = u"Lorum ☎ Ipsum"                          
>>> foo.replace(u'☎', '')                           
u'Lorum  Ipsum'                                     
>>> "".join(s for s in foo if s in string.printable)
u'Lorum  Ipsum'

Remove non-ascii characters but leave periods and spaces for more information about string.printable
The SHORTEST way to remove multiple spaces in a string in Python if you don't want multiple whitespaces.

edited May 23 '17 at 12:02

Community

1
1

answered May 06 '13 at 15:27

timss

9,982
4
34
56

Writing that ☎ character directly on terminal it worked, but not on my pipeline. Replacing it by \u260e worked better. Thank you for those 2 additional hints :) – rafa May 06 '13 at 15:51

score 1 · Answer 3 · edited May 23 '17 at 10:24

1

You may try with BeatifulSoup, as explained here, with something like

soup = BeautifulSoup (html.decode('utf-8', 'ignore'))

edited May 23 '17 at 10:24

Community

1
1

answered May 06 '13 at 15:29

kiriloff

25,609
37
148
229

How to eliminate the ☎ unicode?

3 Answers3

Linked