3

I'm parsing a url in Python, below you can find a sample url and the code, what i want to do is splitting the (74743) from the url and make a for loop which will be taking it from a parts list. Tried to use urlparse but couldn't complete it to the end mostly because of the changing parts in the url. Ijust want the easiest and fastest way to do this.

Sample url:

http://example.com/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/?PartNo=74743&IntNumberOf=&is=

(http://example.com/wps/portal) Always fixed

(lYuxDoIwGAYf6f9aqKSjMNQ) Always changing

(74743) Will be taken from a list name Parts

(IntNumberOf=&is=) Also changing depending on the section of the website

Here's the Code:

from lxml import html
import requests
import urlparse


Parts = [74743, 85731, 93021]

url = 'http://example.com/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/?PartNo=74743&IntNumberOf=&is='

parsing = urlparse.urlsplit(url)

print parsing
T.M
  • 93
  • 9

1 Answers1

4
>>> import urlparse

>>> url = 'http://example.com/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/?PartNo=74743&IntNumberOf=&is='

>>> split_url = urlparse.urlsplit(url)
>>> split_url.path
'/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/'

You can split the path into a list of strings using '/', slice the list, and re-join:

>>> path = split_url.path
>>> path.split('/')
['', 'wps', 'portal', 'lYuxDoIwGAYf6f9aqKSjMNQ', '']

Slice off the last two:

>>> path.split('/')[:-2]
['', 'wps', 'portal']

And re-join:

>>> '/'.join(path.split('/')[:-2])
'/wps/portal'

To parse the query, use parse_qs:

>>> parsed_query = urlparse.parse_qs(split_url.query)
{'PartNo': ['74743']}

To keep the empty parameters use keep_blank_values=True:

>>> query = urlparse.parse_qs(split_url.query, keep_blank_values=True)
>>> query
{'PartNo': ['74743'], 'is': [''], 'IntNumberOf': ['']}

You can then modify the query dictionary:

>>> query['PartNo'] = 85731

And update the original split_url:

>>> updated = split_url._replace(path='/'.join(base_path.split('/')[:-2] +
                                              ['ASDFZXCVQWER', '']),
                                query=urllib.urlencode(query, doseq=True))

>>> urlparse.urlunsplit(updated)
'http://example.com/wps/portal/ASDFZXCVQWER/?PartNo=85731&IntNumberOf=&is='
Peter Wood
  • 23,859
  • 5
  • 60
  • 99
  • for the base_path, what about if i have more than two '/' ... like ( /wps/portal/ut/p/c1/lYuxDoIwGAYf6f9aqKSjMNQ/ , how can i deal with it ? – T.M Oct 18 '15 at 23:09
  • @T.M What url? Have you tried the code? If you have another question, ask a new question. Read [how to ask](http://www.stackoverflow.com/help/how-to-ask) first, particularly the section on how to create a [Minimal, Complete, Verifiable Example](http://stackoverflow.com/help/mcve). – Peter Wood Oct 19 '15 at 20:20
  • sorry my computer got jammed .. thanks , appreciate it .. but with this url : 'url = 'http://www.example.com/wps/portal/!ut/p/c1/04_SB8K8xLLM9MSSzPy8xBz9CP0os3g_A-ewIE8TIwN3Q0tDA0_v4EDLUCNHIwMvc6B8JJK8QbCpgYGniU9YiLOPu7GBgQFJut0DwkxBuoONggO8jA08jQjo9vPIz03Vj9SPMsepyslUP0Q_0hWoKBKvooLc0IhyQ91AAHb2Eas!/dl2/d1/L0lDUmlTUSEhL3dHa0FKRnNBL1lCUlp3QSEhL2Vu/?PartNo=85731&IntNumberOf=&is=' for the base_path it gives me nothing, and it gives me an error with (updated) "Invalid syntax" – T.M Oct 19 '15 at 20:44
  • Apologies, I was using os.path.basename without thinking. I've replaced with an example using str.split. – Peter Wood Oct 19 '15 at 21:00
  • thanks the first part works very good but "updated" throw me a Traceback ... Traceback (most recent call last): File "solving_url_issue2.py", line 41, in updated = split_url._update(path='/'.join(base_path.split('/')[:-2] + AttributeError: 'SplitResult' object has no attribute '_update'.... tried to find a solution for it but didn't find any.. – T.M Oct 20 '15 at 21:37
  • Apologies, it should have been `_replace`, not `_update`. I can never remember that, and hadn't checked. Sorry. The object is a [`namedtuple`](https://docs.python.org/2/library/collections.html#collections.namedtuple). – Peter Wood Oct 20 '15 at 21:42
  • the whole code is working perfectly but the last part joined the url like this : http://example.com/wps/portal/ASDFZXCVQWER/?PartNo=85731&is=%5B%27%27%5D&IntNumberOf=%5B%27%27%5D ... changed than what was requested above .. – T.M Oct 21 '15 at 19:16
  • @T.M Sorry, you need to add [**`doseq=True`**](https://docs.python.org/2/library/urllib.html#urllib.urlencode) to **`urlencode`**. I've updated the answer. – Peter Wood Oct 21 '15 at 19:41
  • :) the last part " &is=&IntNumberOf=" not as the above "&IntNumberOf=&is=" New url : http://example.com/wps/portal/ASDFZXCVQWER/?PartNo=85731&is=&IntNumberOf= – T.M Oct 21 '15 at 20:28
  • @T.M Does that matter? – Peter Wood Oct 21 '15 at 21:18
  • yes it gave me an error when i tried it,.. also after fixing the code can i make this "ASDFZXCVQWER" as a variable that can be anything pulled after scraping the domain to get the exact web page .. – T.M Oct 21 '15 at 21:46
  • Maybe [see this question](http://stackoverflow.com/questions/25107663/keeping-url-parameters-in-order-when-encoding-with-urllib) about using `OrderedDict` instead of a `dict` for the query parameters – Peter Wood Oct 21 '15 at 22:33
  • for the error , i meant a page error 404 ( not exist ) cause of the last part of the url .. in the original url " &IntNumberOf=&is=" but in the code above it gave me " &is=&IntNumberOf=" .. switched – T.M Oct 23 '15 at 19:34