9

During web scraping and after getting rid of all html tags, I got the black telephone character \u260e in unicode (☎). But unlike this response I do want to get rid of it too.

I used the following regular expressions in Scrapy to eliminate html tags:

pattern = re.compile("<.*?>|&nbsp;|&amp;",re.DOTALL|re.M)

Then I tried to match \u260e and I think I got caught by the backslash plague. I tried unsuccessfully this patterns:

pattern = re.compile("<.*?>|&nbsp;|&amp;|\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>|&nbsp;|&amp;|\\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>|&nbsp;|&amp;|\\\\u260e",re.DOTALL|re.M)

None of this worked and I still have \u260e as an output. How can I make this disappear?

Community
  • 1
  • 1
rafa
  • 795
  • 1
  • 8
  • 25
  • As mentioned on your link, raw strings are the antidote to backslash plague. It may not be the most relevant thing here, but keep it in mind. – Mike May 06 '13 at 15:47
  • In line with @Rubens answer, the problem you're facing is that regular strings *aren't* properly unicode encoded, unless you prefix the with `u`. – jpaugh May 06 '13 at 15:56
  • +1 Because this is the first time I've seen a ☎ in a URL –  Jul 10 '15 at 00:33

3 Answers3

7

Using Python 2.7.3, the following works fine for me:

import re

pattern = re.compile(u"<.*?>|&nbsp;|&amp;|\u260e",re.DOTALL|re.M)
s = u"bla ble \u260e blo"
re.sub(pattern, "", s)

Output:

u'bla ble  blo'

As pointed by @Zack, this works due to the fact that the string is now in unicode, i.e., the string is already converted, and the sequence of characters \u260e is now the -- probably -- two bytes used to write that little black phone ☎ (:

Once both the string to be searched and the regular expression have the black phone itself, and not the sequence of characters \u260e, they both match.

Rubens
  • 14,478
  • 11
  • 63
  • 92
  • 7
    Good answer but you should maybe emphasize that the key difference here is those `u` prefixes on all the strings, i.e. operating on Unicode rather than byte strings. – zwol May 06 '13 at 15:37
  • I guess that u prefix made some difference. It worked, thanks. – rafa May 06 '13 at 15:48
4

If your string is already unicode, there's two easy ways. The second one will affect more than just the ☎, obviously.

>>> import string                                   
>>> foo = u"Lorum ☎ Ipsum"                          
>>> foo.replace(u'☎', '')                           
u'Lorum  Ipsum'                                     
>>> "".join(s for s in foo if s in string.printable)
u'Lorum  Ipsum'      
Community
  • 1
  • 1
timss
  • 9,982
  • 4
  • 34
  • 56
  • Writing that ☎ character directly on terminal it worked, but not on my pipeline. Replacing it by \u260e worked better. Thank you for those 2 additional hints :) – rafa May 06 '13 at 15:51
1

You may try with BeatifulSoup, as explained here, with something like

soup = BeautifulSoup (html.decode('utf-8', 'ignore'))
Community
  • 1
  • 1
kiriloff
  • 25,609
  • 37
  • 148
  • 229