How can I remove characters overflow from the target one

Question

I'm scraping web but I'm stack with an error and I failed to handle it

I've the following url but can't work for me to make it work require to remove all characters after jpg and I tried it with regular expression but I couldnt and I dont want to count overflowing characters and index on url to remove it because it will not work on all of image

https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg/498px-Lebron_wizards_2017_%28cropped%29.jpg

this is the code I've tried so far

regex = re.sub('[^*\w]/+\w', '', image)

# and

sep = 'jpg/'
rest = image.split(sep, 1)[0]
print(rest)

but I failed and I also check here this question but I can't find any solution because of my url contains weird and similar characters.

an expected result looks like this

https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg

score 1 · Answer 1 · answered Jul 16 '20 at 22:22

You can try splitting the URL based on / and remove the last part and join back using /.

Something like:

# Input
i_string = "https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg/498px-Lebron_wizards_2017_%28cropped%29.jpg"
print(i_string)

# Approach 1
res = "/".join(i_string.split("/")[:-1])
print(res)


# Approach 2, using function
def remove_last_part(i_string: str) -> str:
    return "/".join(i_string.split("/")[:-1])


print(remove_last_part(i_string))

Result

https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg/498px-Lebron_wizards_2017_%28cropped%29.jpg
https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg
https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg

Since you want to remove all the chars after jpg, you can do something like following:

# Input
i_string = "https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg/498px-Lebron_wizards_2017_%28cropped%29.jpg"
print(i_string)

def remove_last_part(i_string: str, delimiter: str) -> str:
    return delimiter.join(i_string.split(delimiter)[:1]) + delimiter


print(remove_last_part(i_string, "jpg"))

Result:

https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg/498px-Lebron_wizards_2017_%28cropped%29.jpg
https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg

score 1 · Accepted Answer · answered Jul 16 '20 at 22:41

this will stop the match at the first occurrence of .jpg in the given URL.

import re

url = "https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg/498px-Lebron_wizards_2017_%28cropped%29.jpg"
regex = r"^https?://[^\s/$.?#].[^\s]*?(?:\.jpg)"
result = re.match(regex, url)
print(result.group())

Output:

https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg

https://regex101.com/r/i6mu3z/1

How can I remove characters overflow from the target one

2 Answers2