0

I'm scraping web but I'm stack with an error and I failed to handle it

I've the following url but can't work for me to make it work require to remove all characters after jpg and I tried it with regular expression but I couldnt and I dont want to count overflowing characters and index on url to remove it because it will not work on all of image

https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg/498px-Lebron_wizards_2017_%28cropped%29.jpg

this is the code I've tried so far

regex = re.sub('[^*\w]/+\w', '', image)

# and

sep = 'jpg/'
rest = image.split(sep, 1)[0]
print(rest)

but I failed and I also check here this question but I can't find any solution because of my url contains weird and similar characters.

an expected result looks like this

https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg
Umutambyi Gad
  • 4,082
  • 3
  • 18
  • 39

2 Answers2

1

You can try splitting the URL based on / and remove the last part and join back using /.

Something like:

# Input
i_string = "https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg/498px-Lebron_wizards_2017_%28cropped%29.jpg"
print(i_string)

# Approach 1
res = "/".join(i_string.split("/")[:-1])
print(res)


# Approach 2, using function
def remove_last_part(i_string: str) -> str:
    return "/".join(i_string.split("/")[:-1])


print(remove_last_part(i_string))

Result

https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg/498px-Lebron_wizards_2017_%28cropped%29.jpg
https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg
https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg

Since you want to remove all the chars after jpg, you can do something like following:

# Input
i_string = "https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg/498px-Lebron_wizards_2017_%28cropped%29.jpg"
print(i_string)

def remove_last_part(i_string: str, delimiter: str) -> str:
    return delimiter.join(i_string.split(delimiter)[:1]) + delimiter


print(remove_last_part(i_string, "jpg"))

Result:

https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg/498px-Lebron_wizards_2017_%28cropped%29.jpg
https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg
Ambrish
  • 3,627
  • 2
  • 27
  • 42
1

this will stop the match at the first occurrence of .jpg in the given URL.

import re

url = "https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg/498px-Lebron_wizards_2017_%28cropped%29.jpg"
regex = r"^https?://[^\s/$.?#].[^\s]*?(?:\.jpg)"
result = re.match(regex, url)
print(result.group())

Output:

https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg

https://regex101.com/r/i6mu3z/1

Vishal Singh
  • 6,014
  • 2
  • 17
  • 33