0

I have a variable such as title:

title = "révolution_essentielle"

I could encode and decode it like this for other purposes:

title1 = unicode(title, encoding = "utf-8")

But how do I preserve the Non-English and use it as part of a url string to access the url? For instance, I ideally want https://mainurl.com/révolution_essentielle.html by concatenating several strings including title like this:

url = main_url + "/" + title + ".html"

Could anyone kindly show me how to do that? Thanks a bunch!

Ashley Mills
  • 50,474
  • 16
  • 129
  • 160
shenglih
  • 879
  • 2
  • 8
  • 18
  • What error are you getting now? – Zesima29 Jul 07 '18 at 17:50
  • @Zesima29, it shows up as "r\xc3\xa9volution" in the url – shenglih Jul 07 '18 at 18:03
  • Are you looking for `urllib.parse.quote("révolution_essentielle")` (resp. `urllib.quote("révolution_essentielle")` for Py2)? – Ondrej K. Jul 07 '18 at 19:56
  • This isn't Spanish, it's French! – lenz Jul 07 '18 at 20:08
  • @OndrejK. Thanks but that returns a KeyError – shenglih Jul 07 '18 at 22:39
  • @shenglih That very line I've pasted? Or something else? Py3:`import urllib.parse; urllib.parse.quote("révolution_essentielle")` -> `'r%C3%A9volution_essentielle'`; Py2: `import urllib ; urllib.quote("révolution_essentielle")` -> `'r%C3%A9volution_essentielle'` – Ondrej K. Jul 07 '18 at 22:47
  • @OndrejK. Oh sorry I tested your line on another Non-English instance ``u'hey_there_who_likes_lego_that\xe3\u0192\xe2_\xe3_\xe2_\u0161\xe2_\xe3_\xe2_\u017e\xe2_s_all_that_needs_to_be_said_it\xe3\u0192\xe2_\xe3_\xe2_\u0161\xe2_\xe3_\xe2_\u017e\xe2_s_a_vector_with_a_million_uses_download_for_free_the_entire_alphabet_made_from_vector_lego'``... Sorry yours did work for the one I gave in the question – shenglih Jul 07 '18 at 22:51
  • 1
    @shenglih OK, got two pieces of information: we're talking Python 2. And I really hope this produces valid URL material... encode your Unicode string first: `import urllib ; urllib.quote(u'hey_there_who_likes_lego_that\xe3\u0192\xe2_\xe3_...'.encode('utf8'))` – Ondrej K. Jul 07 '18 at 22:55
  • @OndrejK. That's perfect!! Thanks so much! – shenglih Jul 07 '18 at 23:51
  • @lenz you are absolutely right! lol Thanks a lot! Quick question: the codes that work for any non-English language should be readily applicable to other non-English languages, correct? – shenglih Jul 07 '18 at 23:53
  • Your problem really doesn't have anything to do with languages, it's just about encoding strings. The suggested method should work for any encoded string, I guess, no matter what language is used to produce the text in question. – lenz Jul 08 '18 at 08:47
  • @lenz true true, thanks lenz! – shenglih Jul 10 '18 at 17:32

1 Answers1

0

To summarize what we've talked about in the comments: there is a function for quoting URLs (replacing special characters with % prefix escape sequences.

For Python 2 (as used in this case), it's urllib.quote(), which can be used as follows:

urllib.quote("révolution_essentielle")

When our input is an unicode object with wide characters, we need to also encode it first, e.g.:

urllib.quote(u'hey_there_who_likes_lego_that\xe3\u019\xe2_\xe3_...'.encode('utf8')).

Be ware though so that your representation matches the one expected/understood by the counterpart machine.


If we were talking Python 3, the equivalent function would be urllib.parse.quote():

urllib.parse.quote("révolution_essentielle")

Which can chew over str (unicode) parameters as well as encoded value in bytes object.

Ondrej K.
  • 8,841
  • 11
  • 24
  • 39