1

I have a link like this <a href=abc.asp?xyz=foobar&baz=lookatme´_beautiful.jpg> , where there is this unusual symbol ´ , which is not even present in a standard English keyboard. It is the mirror reflection of the symbol that Ctrl+k produces in this editor . So after I ran this code found on stackoverflow:

soup = BeautifulSoup.BeautifulSoup("<a href=abc.asp?xyz=foobar&baz=lookatme´_beautiful.jpg>");
for a in soup.findAll('a'):                                                                       
    print a['href']

The output is abc.asp?xyz=foobar&baz=lookatme but I want to have abc.asp?xyz=foobar&baz=lookatme´_beautiful.jpg . The website that I'm scraping is in a .br domain . Some of the writings is in Portugese , even though the links are in English , but that uncommon symbol may not be a valid English language symbol. Any thoughts or suggestions ?

Edit: I looked at the representation that Python string produced me , it was <a href=abc.asp?xyz=foobar&baz=lookatme\xb4_beautiful.jpg>

One way around is to produce custom regex , and this snippet is also from stackoverflow:

import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)

If it is impossible to modify beautifulsoup regex , how can I modify the above regex to incorporate the \xb4 symbol. ( s here is the string in question )

motiur
  • 1,640
  • 9
  • 33
  • 61
  • Can you post a link to the webpage? – Blender Jul 23 '13 at 23:24
  • http://stackoverflow.com/questions/499345/regular-expression-to-extract-url-from-an-html-link -- this is stackoverflow website and this is website I am trying to scrape http://www.atlasdermatologico.com.br/listar.asp?acao=mostrar&arquivo=sweet%B4s_syndrome48.jpg -- do not look into other links in the webpage ; its gross for medical professional only . I am not able to incorporate %B4s in my regex , I saw the string representation of \xb4 escaped in my python string . – motiur Jul 23 '13 at 23:38

3 Answers3

0

You can include [\u0000-\uFFFF] as a subrange in re pattern or only include \xb4 as [\u00b4]

sgun
  • 899
  • 6
  • 12
0

Upgrade to the latest version of BeautifulSoup and install html5lib, which is a very lenient parser:

import requests
from bs4 import BeautifulSoup

html = requests.get('http://www.atlasdermatologico.com.br/listar.asp?acao=indice').text
soup = BeautifulSoup(html, 'html5lib')

for a in soup.find_all('a'):
    href = a.get('href')

    if '\\' in repr(href):
        print(repr(href))

It correctly prints out the links with \xb4 in the URL.

Blender
  • 289,723
  • 53
  • 439
  • 496
  • Thanks it works nice , but there is a problem how to convince my browser to consider http://www.atlasdermatologico.com.br/listar.asp?acao=mostrar&arquivo=wells´_syndrome7.jpg is actually a link i.e how to subsitute ´ with %B4s . Clicking that particular link in Chrome renders the webpage properly , but if I literally put that link in omnibox , it does not work . Is there any built in function in Python that will help me. – motiur Jul 24 '13 at 07:48
0

Your regexp doesn't care what follows href=, as long as it ends with a space (or are in quotes), so it matches \xb4 just like any other character:

>>> s = "<a href=abc.asp?xyz=foobar&baz=lookatme\xb4_beautiful.jpg>"
>>> print s.decode("latin-1")
<a href=abc.asp?xyz=foobar&baz=lookatme´_beautiful.jpg>
>>> urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
>>> urls
['abc.asp?xyz=foobar&baz=lookatme\xb4_beautiful.jpg']

(btw, \xb4 is an acute accent)

Fredrik
  • 940
  • 4
  • 10