0

; can not be dealt by parse_qsl(). Is there a way to make it aware of ;? Thanks.

>>> import urllib.parse
>>> urllib.parse.parse_qsl('http://example.com/?q=abc&p=1;2;3')
[('http://example.com/?q', 'abc'), ('p', '1')]
wim
  • 338,267
  • 99
  • 616
  • 750
user1424739
  • 11,937
  • 17
  • 63
  • 152
  • 1
    Basically, you (or in this case `urllib.parse.parse_qsl()`) are supposed to treat ";" like "&" in a URL. So `urllib` sees your URL the same way as it would see `http://example.com/?q=abc&p=1&2&3`. If you can, you should encode the semicolons in the URL like this: `http://example.com/?q=abc&p=1%3B2%3B3` or separate the numbers with commas instead of semicolons. If you don't control the URLs you might have to parse the querystring yourself. https://stackoverflow.com/a/1178285/495319 – Wodin Nov 09 '19 at 22:04
  • 1
    Could you post the complete solution as an answer? Thanks. – user1424739 Nov 09 '19 at 22:54

2 Answers2

2

It would be best to make sure that the URLs you are dealing with have the semicolons URL encoded. e.g. http://example.com/?q=abc&p=1%3B2%3B3

If for some reason you can't do the above, you could do something like this:

from urllib.parse import urlparse, unquote_plus

url = "http://example.com/?q=abc&p=1;2;3"
parts = urlparse(url)
qs = parts.query
pairs = [p.split("=", 1) for p in qs.split("&")]
decoded = [(unquote_plus(k), unquote_plus(v)) for (k, v) in pairs]
>>> decoded
[('q', 'abc'), ('p', '1;2;3')]

The above code assumes a few things about the query string. e.g. that all keys have values. If you want something that makes fewer assumptions, see the parse_qsl source code.

Wodin
  • 3,243
  • 1
  • 26
  • 55
0

Actually, it does treat them correctly (as delimiters). You just have to tell it to keep blank values:

>>> urllib.parse.parse_qsl('q=abc&p=1;2;3', keep_blank_values=True)
[('q', 'abc'), ('p', '1'), ('2', ''), ('3', '')]

Note that you should not be passing the entire url to parse_qsl, only the querystring part.

wim
  • 338,267
  • 99
  • 616
  • 750
  • I think for this case, 1;2;3 are all the value of p. – user1424739 Nov 11 '19 at 17:56
  • No, it's not correct. For that the querystring should look like `q=abc&p=1%3B2%3B3`, if you receive an unencoded querystring then you have some problem elsewhere and need to address it there (otherwise you will eventually get double-decoding bugs). – wim Nov 11 '19 at 18:13
  • 2
    Semicolon is in RFC 1738 reserved characters (';', '/', '?', ':', '@', '=' and '&') so needs to be urlencoded in the querystring. – wim Nov 11 '19 at 18:21
  • I am not talking about the RFC. I am talking about the specific example. Obviously the website which my example is derived does not following this RFC. – user1424739 Nov 11 '19 at 20:53
  • 1
    The querystring is data received from *client*. When client sends malformed request, you should send them back 400 response. Do not work around it and try to parse corrupt data. – wim Nov 11 '19 at 20:58