UTF8 Character in URL String

Question

I wrote a little Python script that parses a website. I got a "ä" character in form of \u00e4 in a url from a link like http://foo.com/h\u00e4ppo, and I need http://foo.com/häppo.

`\u00e4` is just a way to represent the character `ä`. ie. `'ä' == '\u00e4'`. What are you doing with the character after receiving and where does that fail? — zvone, Sep 29 '16 at 19:58
i parse the json in my Xcode project (Swift 3) and in the if let pattern the cast to URL fails with the \u00e4 in the string — Fab ian, Sep 29 '16 at 20:21
Please reduce your original program to the shortest **complete** program that demonstrates the error. Please [edit\] your question and copy-paste that short complete program into your question. Please include both the expected and actual output, and the full text of any error message. See [mcve] and [ask] for more info. — Robᵩ, Sep 29 '16 at 20:32

score 0 · Answer 1 · edited May 23 '17 at 12:00

0

The character \u00e4 which you have is already correct. That is in fact ä.

Sometimes, the representation (repr) of a string will display it in the escaped form, just as backslash \ will be display as escaped \\. That part is fine.

The Actual Problem

The actual problem is that you cannot use ä in URL. Only a small subset o ASCII characters is valid in URLS (see Which characters make a URL invalid?).

So, you have to escape parts of your URL.

>>> urllib.parse.quote('ä')
'%C3%A4'

>>> urllib.parse.quote('\u00e4')  # same thing
'%C3%A4'

But be careful not to escape the whole URL, only parts of it which are actual strings to be escaped. For example, this is wrong:

>>> urllib.parse.quote('https://www.google.com/?q=\u00e4')
'https%3A//www.google.com/%3Fq%3D%C3%A4'

You'll want to do:

>>> 'https://www.google.com/?q=' + urllib.parse.quote('\u00e4')
'https://www.google.com/?q=%C3%A4'

Try it and see what happens: https://www.google.com/?q=%C3%A4

edited May 23 '17 at 12:00

Community

1
1

answered Sep 29 '16 at 20:32

zvone

18,045
3
49
77

i have tested it in my Xcode project. if I use "\u00e4".addingPercentEncoding(withAllowedCharacters: CharacterSet.urlQueryAllowed) witch cast the \u00e4 to %C3%A4, but i want to use it directly in my python script. So I tested urllib.parse.quote('\u00e4') but I get AttributeError: 'module' object has no attribute 'parse' – Fab ian Sep 29 '16 at 20:41
@Fabian I don't know what Xcode is or what you mean by "directly in my python script" – zvone Sep 29 '16 at 20:44
Xcode is an development envirement for iOS App by Apple. The python script is a little script on my server that parse a website and create a json file with the content of the website. – Fab ian Sep 29 '16 at 20:57
So now I tested >>> urllib.parse.quote('ä') '%C3%A4' >>> urllib.parse.quote('\u00e4') # same thing '%C3%A4' But with urllib.parse.quote('\u00e4') i get '5Cu00e4' instead of '%C3%A4' Any idea why? – Fab ian Sep 29 '16 at 21:07
@Fabian Which Python version? It should be `'\u00e4'` in Python 3 and `u'\u00e4'` in Python 2. See http://stackoverflow.com/q/18034272/389289 – zvone Sep 29 '16 at 22:39

score 0 · Answer 2 · answered Sep 29 '16 at 20:38

Unluckily this depends heavily on the encoding of the site you parsed, as well as your local IO encoding.

I'm not really sure if you can translate it after parsing, and if it's really worth the work. If you have the chance to parse it again you can try using python's decode() function, like:

text.decode('utf8')

Besides that, check that the encoding used above is the same that in your local environment. This is specially important on Windows environments, since they use cp1252 as their standard encoding.

In Mac and Linux: export PYTHONIOENCODING=utf8 In Windows: set PYTHONIOENCODING=utf8

It's not much, but I hope it helps.

UTF8 Character in URL String

2 Answers2

The Actual Problem