Split the title part of the URL into a separate column - Python

Question

Suppose I have a URL as follows:

http://sitename.com/pathname?title=moviename&url=VIDEO_URL

I want to parse this URL to get the title part and url part alone separately.

I tried the following,

from urlparse import urlparse
q = urlparse('http://sitename.com/pathname?title=moviename&url=VIDEO_URL')

After I do this, I get the following result,

q
ParseResult(scheme='http', netloc='sitename.com', path='/pathname', params='', query='title=moviename&url=VIDEO_URL', fragment='')

and q.query has,

'title=moviename&url=VIDEO_URL'

I am not able to use q.query.title or q.query.url here. Is there a way I can access this? I would like to split the url and title part separately into separate columns. Can we do it this way or can we write a substring method which would check for starting with "title" and ending with "&" and split it?

Thanks

Aaron Christiansen · Accepted Answer · 2016-03-17T18:32:04.877

7

You can use urlparse.parse_qs here to make a dictionary of parameters.

from urlparse import urlparse, parse_qs
q = urlparse('http://sitename.com/pathname?title=moviename&url=VIDEO_URL')
qs = parse_qs(q.query)
print qs["title"] # moviename
print qs["url"] # VIDEO_URL

This is the most reliable way to parse a URL's parameters: much better than split.

edited Mar 17 '16 at 18:32

answered Mar 17 '16 at 17:42

Aaron Christiansen

11,584
5
52
78

score 1 · Answer 2 · answered Mar 17 '16 at 17:47

urlparse can parse the url, from there get query and parse that:

>>> import urlparse
>>> url = 'http://sitename.com/pathname?title=moviename&url=VIDEO_URL'
>>> urlparse.parse_qs(urlparse.urlparse(url).query)
{'title': ['moviename'], 'url': ['VIDEO_URL']}

As the query string parameter can appear multiple times, the dictionary provides list of found values (even when there is only one value found.)

zmo · Answer 3 · 2016-03-17T17:47:58.077

You're doing it right, it's just that a standard URL is made of:

<SCHEME>://<NETLOC>/<PATH>?<QUERY>

so what you want to extract the details from the query is to split the string, like that, if you like the dirty way:

>>> data = dict(item.split('=') for item in q.query.split('&'))
>>> data
{'url': 'VIDEO_URL', 'title': 'moviename'}
>>> print(data['url'])

and there you have your URL! This a a very basic and canonical version of what the urlparse library offers through the parse_qsl() method. That method also converts + into spaces, handles ';' as well as & and unquotes the URL.

So to use urlparse's parse_qsl function, all you have to do is:

>> data =urlparse.parse_qsl(q.query)
{'url': 'VIDEO_URL', 'title': 'moviename'}
>>> print(data['url'])

N.B.: it's NOT safer to use parse_qsl than the split() method, but more RELIABLE. The main difference is that parse_qsl will work with all possible use cases of queries as defined by the RFC, whereas the split() method works with a single case.

score 0 · Answer 4 · edited May 23 '17 at 12:15

These answers are spot on for parsing the query string. To go a step further and also use dot notation, also see Convert Python dict to object?

from collections import namedtuple
QS = namedtuple('QS', qs.keys())
dotted_qs = QS(**qs)
dotted_qs.url #['moviename']

Note that the dict that comes back from parse_qs can be multi-valued, hence the list return type of dotted.url. You can collapse it to single value with a dict comprehension or parse_qsl:

qs = {k: v[0] for k, v in q.query.items()}

Or...

qs = dict(urlparse.parse_qsl(q.query.items()))

Hope that helps.

score -1 · Answer 5 · answered Mar 17 '16 at 17:42

-1

To get just the query parameters split by the '&' you can use:

q.query.split('&')

Or to get pairs of parameter/value you can use:

args = [tuple(arg.split('=')) for arg in q.query.split('&')]

answered Mar 17 '16 at 17:42

avip

1,445
13
14

Split the title part of the URL into a separate column - Python

5 Answers5