Is it possible in Python to capture individual parts of a URL with constant structure?

Question

If my question is vague, I apologize, it's a difficult question to put to words. If, for example, I needed parts of this URL:
https://stackoverflow.com/questions/449775/how-can-i-split-a-url-string-up-into-separate-parts-in-python

I needed the question number, and the question title, and let's assume the title is followed by some other changing characters, but still separated by a "/". The base URL, and the word "questions" never change. The data I want changes, but is unique and specific to each question. However all this information is always in the same place in the URL.

Is there a way to parse this URL in python and separate what I need?

score 0 · Answer 1 · answered Oct 19 '22 at 04:22

Let's take the link of this question which is

https://stackoverflow.com/questions/74119810/is-it-possible-in-python-to-capture-individual-parts-of-a-url-with-constant-stru

Now, if you see the pattern, after https://(just ignore it), we have 2 "/". So we can split it based on these.

In [1]: link = "https://stackoverflow.com/questions/74119810/is-it-possible-in-p
   ...: ython-to-capture-individual-parts-of-a-url-with-constant-stru"

Let's remove https first

In [3]: link[8:]                                                                
Out[3]: 'stackoverflow.com/questions/74119810/is-it-possible-in-python-to-capture-individual-parts-of-a-url-with-constant-stru'

Now split it

In [4]: link[8:].split('/')                                                     
Out[4]: 
['stackoverflow.com',
 'questions',
 '74119810',
 'is-it-possible-in-python-to-capture-individual-parts-of-a-url-with-constant-stru']

Now the question id is index number 2. so

In [5]: link[8:].split('/')[2]                                                  
Out[5]: '74119810'

Let's wrap it into a function:

In [6]: def get_qid(link:str): 
   ...:     return link[8:].split('/')[2]

And test it on a separate link.

In [7]: get_qid("https://stackoverflow.com/questions/74119795/how-to-create-sess
   ...: ion-in-graphql-in-fastapi-to-store-token-safely-after-generati")        
Out[7]: '74119795'

As far as Question Title is concerned, you need to do some web scraping or use some kind of API to do so. Even though you can extract it from the link, it wont be complete since link removes some of the part of the title.

As you can see in this example:

In [10]: ' '.join(link[8:].split('/')[-1].split('-'))                           
Out[10]: 'is it possible in python to capture individual parts of a url with constant stru'

The last element of the splited link is title, we split it based on '-' which represents the space, and join it via space using ' '.join. The returned title is not complete since it was not encoded completely in the link.

score 0 · Accepted Answer · answered Oct 19 '22 at 04:28

The code below will pick apart the URL using str.split() with '/' as a delimiter then assign the portion of interest to variables.

It's not particularly robust but given your specification that the base URL is always the same format this is an efficient way to do what you asked:

URL="https://stackoverflow.com/questions/449775/how-can-i-split-a-url-string-up-into-separate-parts-in-python"

protocol, _, server, question, question_number, question_title, *_ = URL.split("/")
print("protocol:       ", protocol)
print("server:         ", server)
print("question number:", question_number)
print("question title: ", question_title)

Results:

protocol:        https:
server:          stackoverflow.com
question number: 449775
question title:  how-can-i-split-a-url-string-up-into-separate-parts-in-python

Thank you for your help! Your solution is better and easier than the one I eventually came up with myself. — aaander, Oct 19 '22 at 10:48

Is it possible in Python to capture individual parts of a URL with constant structure?

2 Answers2