Slicing URL with Python

Question

I am working with a huge list of URL's. Just a quick question I have trying to slice a part of the URL out, see below:

http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3

How could I slice out:

http://www.domainname.com/page?CONTENT_ITEM_ID=1234

Sometimes there is more than two parameters after the CONTENT_ITEM_ID and the ID is different each time, I am thinking it can be done by finding the first & and then slicing off the chars before that &, not quite sure how to do this tho.

Cheers

score 14 · Accepted Answer · edited Nov 19 '13 at 14:53

Use the urlparse module. Check this function:

import urlparse

def process_url(url, keep_params=('CONTENT_ITEM_ID=',)):
    parsed= urlparse.urlsplit(url)
    filtered_query= '&'.join(
        qry_item
        for qry_item in parsed.query.split('&')
        if qry_item.startswith(keep_params))
    return urlparse.urlunsplit(parsed[:3] + (filtered_query,) + parsed[4:])

In your example:

>>> process_url(a)
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'

This function has the added bonus that it's easier to use if you decide that you also want some more query parameters, or if the order of the parameters is not fixed, as in:

>>> url='http://www.domainname.com/page?other_value=xx&param3&CONTENT_ITEM_ID=1234&param1'
>>> process_url(url, ('CONTENT_ITEM_ID', 'other_value'))
'http://www.domainname.com/page?other_value=xx&CONTENT_ITEM_ID=1234'

score 4 · Answer 2 · answered Nov 03 '08 at 14:34

4

The quick and dirty solution is this:

>>> "http://something.com/page?CONTENT_ITEM_ID=1234&param3".split("&")[0]
'http://something.com/page?CONTENT_ITEM_ID=1234'

answered Nov 03 '08 at 14:34

Rafał Dowgird

43,216
11
77
90

score 3 · Answer 3 · answered Nov 03 '08 at 14:36

Another option would be to use the split function, with & as a parameter. That way, you'd extract both the base url and both parameters.

   url.split("&")

returns a list with

  ['http://www.domainname.com/page?CONTENT_ITEM_ID=1234', 'param2', 'param3']

score 1 · Answer 4 · answered Nov 03 '08 at 14:33

1

I figured it out below is what I needed to do:

url = "http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3"
url = url[: url.find("&")]
print url
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'

answered Nov 03 '08 at 14:33

RailsSon

19,897
31
82
105

Careful with this - if there are no parameters (no "&"), it will just drop the last character from the url. – Rafał Dowgird Nov 03 '08 at 14:38
See http://stackoverflow.com/questions/229352/python-find-question for a better solution. – S.Lott Nov 03 '08 at 14:42
Ah I see how that could be a problem and thanks for the warning. The list I am using always has a parameter after it but I will keep that in mind for the future. :) – RailsSon Nov 03 '08 at 14:45
Be careful with url parsing, this most of the time not as easy as it seems. You'd better use the urlparse module, even if it looks like it's easy. – Bite code Nov 03 '08 at 15:37
@Eef: Always means "mostly". Never means "Rarely". As soon as you say "Always", you know it will break because 2 of 14,000 violate your "always" rule. – S.Lott Nov 03 '08 at 15:45
@S.Lott: couldn't agree more… – tzot Nov 03 '08 at 19:55
Cheers for the great advice!! I'm taking all this on board :) – RailsSon Nov 03 '08 at 23:31

score 1 · Answer 5 · answered Nov 03 '08 at 15:52

Parsin URL is never as simple I it seems to be, that's why there are the urlparse and urllib modules.

E.G :

import urllib
url ="http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3"
query = urllib.splitquery(url)
result = "?".join((query[0], query[1].split("&")[0]))
print result
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'

This is still not 100 % reliable, but much more than splitting it yourself because there are a lot of valid url format that you and me don't know and discover one day in error logs.

score 0 · Answer 6 · answered Jul 20 '12 at 09:39

0

beside urlparse there is also furl, which has IMHO better API.

answered Jul 20 '12 at 09:39

neutrinus

1,879
2
16
21

score 0 · Answer 7 · answered Feb 24 '10 at 14:43

0

An ancient question, but still, I'd like to remark that query string paramenters can also be separated by ';' not only '&'.

answered Feb 24 '10 at 14:43

Alien Life Form

1,884
1
19
27

score 0 · Answer 8 · answered Nov 03 '08 at 14:34

0

import re
url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3'
m = re.search('(.*?)&', url)
print m.group(1)

answered Nov 03 '08 at 14:34

Corey Goldberg

59,062
28
129
143

score 0 · Answer 9 · edited May 23 '17 at 12:31

0

Look at the urllib2 file name question for some discussion of this topic.

Also see the "Python Find Question" question.

edited May 23 '17 at 12:31

Community

1
1

answered Nov 03 '08 at 14:41

S.Lott

384,516
81
508
779

score 0 · Answer 10 · answered Nov 03 '08 at 15:31

This method isn't dependent on the position of the parameter within the url string. This could be refined, I'm sure, but it gets the point across.

url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3'
parts = url.split('?')
id = dict(i.split('=') for i in parts[1].split('&'))['CONTENT_ITEM_ID']
new_url = parts[0] + '?CONTENT_ITEM_ID=' + id

Slicing URL with Python

10 Answers10

Linked

Related