13

I have the following URL:

https://stackoverflow.com/questions/7990301?aaa=aaa
https://stackoverflow.com/questions/7990300?fr=aladdin
https://stackoverflow.com/questions/22375#6
https://stackoverflow.com/questions/22375?
https://stackoverflow.com/questions/22375#3_1

I need URLs for example:

https://stackoverflow.com/questions/7990301
https://stackoverflow.com/questions/7990300
https://stackoverflow.com/questions/22375
https://stackoverflow.com/questions/22375
https://stackoverflow.com/questions/22375

My attempt:

url='https://stackoverflow.com/questions/7990301?aaa=aaa'
if '?' in url:
    url=url.split('?')[0]
if '#' in url:
    url = url.split('#')[0]

I think this is a stupid way

Boris Verkhovskiy
  • 14,854
  • 11
  • 100
  • 103
xin.chen
  • 964
  • 2
  • 8
  • 24

6 Answers6

15

The very helpful library furl makes it trivial to remove both query and fragment parts:

>>> furl.furl("https://hi.com/?abc=def#ghi").remove(args=True, fragment=True).url
https://hi.com/
Matthew Story
  • 3,573
  • 15
  • 26
  • 4
    Why download this library when the builtin Python way is basically exactly the same: `from urllib.parse import urlsplit, urlunsplit` then `urlunsplit(urlsplit("https://hi.com/?abc=def#ghi")._replace(query="", fragment=""))` – Boris Verkhovskiy May 08 '21 at 04:20
7

You can split on something that doesn't exist in the string, you'll just get a list of one element, so depending on your goal, you could do something like this to simplify your existing code:

url = url.split('?')[0].split('#')[0]

Not saying this is the best way (furl is a great solution), but it is a way.

TheDavidFactor
  • 1,647
  • 2
  • 19
  • 18
4

In your example you're also removing the fragment (the thing after a #), not just the query.

You can remove both by using urllib.parse.urlsplit, then calling ._replace on the namedtuple it returns and converting back to a string URL with urllib.parse.unsplit:

from urllib.parse import urlsplit, urlunsplit

def remove_query_params_and_fragment(url):
    return urlunsplit(urlsplit(url)._replace(query="", fragment=""))

Output:

>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/7990301?aaa=aaa")
'https://stackoverflow.com/questions/7990301'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/7990300?fr=aladdin")
'https://stackoverflow.com/questions/7990300'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/22375#6")
'https://stackoverflow.com/questions/22375'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/22375?")
'https://stackoverflow.com/questions/22375'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/22375#3_1")
'https://stackoverflow.com/questions/22375'
Community
  • 1
  • 1
Boris Verkhovskiy
  • 14,854
  • 11
  • 100
  • 103
2

You could try

urls = ["https://stackoverflow.com/questions/7990301?aaa=aaa",
"https://stackoverflow.com/questions/7990300?fr=aladdin",
"https://stackoverflow.com/questions/22375#6",
"https://stackoverflow.com/questions/22375"?,
"https://stackoverflow.com/questions/22375#3_1"]

urls_without_query = [url.split('?')[0] for url in urls]

for example, "https://stackoverflow.com/questions/7990301?aaa=aaa".split() returns a list that looks like ["https://stackoverflow.com/questions/7990301", "aaa=aaa"], and if that string is url, url.split('?')[0] would give you "https://stackoverflow.com/questions/7990301".

Edit: I didn't think about # arguments. The other answers might help you more :)

Jay Calamari
  • 573
  • 4
  • 17
1

You can use w3lib

from w3lib import url as w3_url
url_without_query = w3_url.url_query_cleaner(url)
Lücks
  • 3,806
  • 2
  • 40
  • 54
0

Here is an answer using standard libraries, and which parses the URL properly:

from urllib.parse import urlparse

url = 'http://www.example.com/this/category?one=two'
parsed = urlparse(url)
print("".join([parsed.scheme,"://",parsed.netloc,parsed.path]))

expected output:

http://www.example.com/this/category

Note: this also strips params and the fragment, but is easy to modify to include those if you want.

Tom Anthony
  • 791
  • 7
  • 14