I have a link like this <a href=abc.asp?xyz=foobar&baz=lookatme´_beautiful.jpg>
, where there is this unusual symbol ´
, which is not even present in a standard English keyboard.
It is the mirror reflection of the symbol that Ctrl+k
produces in this editor .
So after I ran this code found on stackoverflow:
soup = BeautifulSoup.BeautifulSoup("<a href=abc.asp?xyz=foobar&baz=lookatme´_beautiful.jpg>");
for a in soup.findAll('a'):
print a['href']
The output is abc.asp?xyz=foobar&baz=lookatme
but I want to have abc.asp?xyz=foobar&baz=lookatme´_beautiful.jpg
. The website that I'm scraping is in a .br
domain . Some of the writings is in Portugese , even though the links are in English , but that uncommon symbol may not be a valid English language symbol. Any thoughts or suggestions ?
Edit: I looked at the representation that Python string produced me , it was <a href=abc.asp?xyz=foobar&baz=lookatme\xb4_beautiful.jpg>
One way around is to produce custom regex , and this snippet is also from stackoverflow:
import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
If it is impossible to modify beautifulsoup regex , how can I modify the above regex to incorporate the \xb4
symbol. ( s here is the string in question )