How to use extended ascii with bs4 url

Question

I've been reluctant to post a question about this, but after 3 days of google I can't get this to work. Long story short i'm making a raid gear tracker for WoW.

I'm using BS4 to handle the webscraping, I'm able to pull the page and scrape the info I need from it. The problem I'm having is when there is an extended ascii character in the player's name, ex: thermíte. (the i is alt+161)

http://us.battle.net/wow/en/character/garrosh/thermíte/advanced

I'm trying to figure out how to re-encode the url so it is more like this:

http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced

I'm using tkinter for the gui, I have the user select their realm from a dropdown and then type in the character name in an entry field.

namefield = Entry(window, textvariable=toonname)

I have a scraping function that performs the initial scrape of the main profile page. this is where I assign the value of namefield to a global variable.(I tried to passing it directly to the scraper from with this

namefield = Entry(window, textvariable=toonname, command=firstscrape)

I thought I was close, because when it passed "thermíte", the scrape function would print out "therm\xC3\xADte" all I needed to do was replace the '\x' with '%' and i'd be golden. But it wouldn't work. I could use mastername.find('\x') and it would find instances of it in the string, but doing mastername.replace('\x','%') wouldn't actually replace anything.

I tried various combinations of r'\x' '\%' r'\x' etc etc. no dice.

Lastly when I try to do things like encode into latin then decode back into utf-8 i get errors about how it can't handle the extended ascii character.

urlpart1 = "http://us.battle.net/wow/en/character/garrosh/"
urlpart2 = mastername
urlpart3 = "/advanced"
url = urlpart1 + urlpart2 + urlpart3

That's what I've been using to try and rebuild the final url(atm i'm leaving the realm constant until I can get the name problem fixed)

Tldr:

I'm trying to take a url with extended ascii like:

http://us.battle.net/wow/en/character/garrosh/thermíte/advanced

And have it become a url that a browser can easily process like:

http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced

with all of the normal extended ascii characters.

I hope this made sense.

here is a pastebin for the full script atm. there are some things in it atm that aren't utilized until later on. pastebin link

jfs · Accepted Answer · 2015-10-27T21:54:07.777

1

There shouldn't be non-ascii characters in the result url. Make sure mastername is a Unicode string (isinstance(mastername, str) on Python 3):

#!/usr/bin/env python3
from urllib.parse import quote

mastername = "thermíte"
assert isinstance(mastername, str)
url = "http://us.battle.net/wow/en/character/garrosh/{mastername}/advanced"\
        .format(mastername=quote(mastername, safe=''))
# -> http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced

edited Oct 27 '15 at 21:54

answered Oct 27 '15 at 05:44

jfs

399,953
195
994
1,670

Thanks. I got it working now. I had to change it a little bit `.format(mastername=urllib.parse.quote(mastername, safe=''))` – thermite Oct 27 '15 at 21:47
@thermite: the code works as is. You use a different import. – jfs Oct 27 '15 at 21:54
So if I have import urllib, it wouldn't work. Shouldn't that do the same thing as from urllib import * ? I would think that would import quote from parse as well – thermite Oct 28 '15 at 18:14
1

@thermite: no, it shouldn't. See ['import module' or 'from module import'](http://stackoverflow.com/q/710551/4279) – jfs Oct 28 '15 at 18:23

Gil · Answer 2 · 2015-10-27T03:12:19.523

0

You can try something like this:

>>> import urllib
>>> 'http://' + '/'.join([urllib.quote(x) for x in url.strip('http://').split('/')]
'http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced'

urllib.quote() "safe" urlencodes characters of a string. You don't want all the characters to be affected, just everything between the '/' characters and excluding the initial 'http://'. So the strip and split functions take those out of the equation, and then you concatenate them back in with the + operator and join

EDIT: This one is on me for not reading the docs... Much cleaner:

>>> url = 'http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced'
>>> urllib.quote(url, safe=':/')
'http://us.battle.net/wow/en/character/garrosh/therm%25C3%25ADte/advanced'

edited Oct 27 '15 at 03:12

answered Oct 27 '15 at 02:55

Gil

370
2
11

But would I be able to use urllib.quote() on a variable that is created somewhere else ? The user is the one entering the name which may contain odd characters. I need to then build it into a safe url. – thermite Oct 27 '15 at 03:05
sure, as long as the user is passing you a string, you can run urllib.quote() on it. The documentation (linked above) also specifies "safe characters"that you can pass so you can have exclusions if you want – Gil Oct 27 '15 at 03:08
@thermite check out the edit... much cleaner way by using the documented safe parameter. Sorry for not reading the docs more clearly the first time around! – Gil Oct 27 '15 at 03:13
Thanks, I just gave it a shot and it's not having it. I keep getting AttributeError: 'module object has no attrbute 'quote' it's calling out urlpartx = urllib.quote(urlpart2) I put a pastebin link in OP if you wanted to take a look at the full script. might make it a bit easier to see what i'm trying to do. And I appreciate the help. I kind of dove into a project that's a little above my skill level. it's how I learn best xD – thermite Oct 27 '15 at 03:17
1

@thermite if you're using python3, the library has changed. you now need urllib.parse.quote https://docs.python.org/3.1/library/urllib.parse.html – Gil Oct 27 '15 at 03:19

How to use extended ascii with bs4 url

2 Answers2