Parsing uncommon symbol using BeautifulSoup

Question

I have a link like this <a href=abc.asp?xyz=foobar&baz=lookatme´_beautiful.jpg> , where there is this unusual symbol ´ , which is not even present in a standard English keyboard. It is the mirror reflection of the symbol that Ctrl+k produces in this editor . So after I ran this code found on stackoverflow:

soup = BeautifulSoup.BeautifulSoup("<a href=abc.asp?xyz=foobar&baz=lookatme´_beautiful.jpg>");
for a in soup.findAll('a'):                                                                       
    print a['href']

The output is abc.asp?xyz=foobar&baz=lookatme but I want to have abc.asp?xyz=foobar&baz=lookatme´_beautiful.jpg . The website that I'm scraping is in a .br domain . Some of the writings is in Portugese , even though the links are in English , but that uncommon symbol may not be a valid English language symbol. Any thoughts or suggestions ?

Edit: I looked at the representation that Python string produced me , it was <a href=abc.asp?xyz=foobar&baz=lookatme\xb4_beautiful.jpg>

One way around is to produce custom regex , and this snippet is also from stackoverflow:

import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)

If it is impossible to modify beautifulsoup regex , how can I modify the above regex to incorporate the \xb4 symbol. ( s here is the string in question )

http://stackoverflow.com/questions/499345/regular-expression-to-extract-url-from-an-html-link -- this is stackoverflow website and this is website I am trying to scrape http://www.atlasdermatologico.com.br/listar.asp?acao=mostrar&arquivo=sweet%B4s_syndrome48.jpg -- do not look into other links in the webpage ; its gross for medical professional only . I am not able to incorporate %B4s in my regex , I saw the string representation of \xb4 escaped in my python string . — motiur, Jul 23 '13 at 23:38

sgun · Answer 1 · 2013-07-24T07:26:22.343

0

You can include [\u0000-\uFFFF] as a subrange in re pattern or only include \xb4 as [\u00b4]

edited Jul 24 '13 at 07:26

answered Jul 23 '13 at 23:25

sgun

899
6
12

\ub4 should be \xb4 or \u00b4, right? \u expects 4 hex digits, like in your first example. – Fredrik Jul 24 '13 at 06:53

score 0 · Accepted Answer · answered Jul 23 '13 at 23:55

0

Upgrade to the latest version of BeautifulSoup and install html5lib, which is a very lenient parser:

import requests
from bs4 import BeautifulSoup

html = requests.get('http://www.atlasdermatologico.com.br/listar.asp?acao=indice').text
soup = BeautifulSoup(html, 'html5lib')

for a in soup.find_all('a'):
    href = a.get('href')

    if '\\' in repr(href):
        print(repr(href))

It correctly prints out the links with \xb4 in the URL.

answered Jul 23 '13 at 23:55

Blender

289,723
53
439
496

Thanks it works nice , but there is a problem how to convince my browser to consider http://www.atlasdermatologico.com.br/listar.asp?acao=mostrar&arquivo=wells´_syndrome7.jpg is actually a link i.e how to subsitute ´ with %B4s . Clicking that particular link in Chrome renders the webpage properly , but if I literally put that link in omnibox , it does not work . Is there any built in function in Python that will help me. – motiur Jul 24 '13 at 07:48

Fredrik · Answer 3 · 2013-07-24T00:05:05.677

Your regexp doesn't care what follows href=, as long as it ends with a space (or are in quotes), so it matches \xb4 just like any other character:

>>> s = "<a href=abc.asp?xyz=foobar&baz=lookatme\xb4_beautiful.jpg>"
>>> print s.decode("latin-1")
<a href=abc.asp?xyz=foobar&baz=lookatme´_beautiful.jpg>
>>> urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
>>> urls
['abc.asp?xyz=foobar&baz=lookatme\xb4_beautiful.jpg']

(btw, \xb4 is an acute accent)

Parsing uncommon symbol using BeautifulSoup

3 Answers3