Delete all characters that come after a given string

Question

how exactly can I delete characters after .jpg? is there a way to differentiate between the extension I take with python and what follows? for example I have a link like that

https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC

How can I delete everything after .jpg? I tried replacing but it didn't work another way? Use a forum to count strings or something like ? I tried to get jpg files with this

for link in links:
            res = requests.get(link).text
            soup = BeautifulSoup(res, 'html.parser')
            img_links = []
            for img in soup.select('a.thumbnail img[src]'):
                print(img["src"])
                with open('links'+'.csv', 'a', encoding = 'utf-8', newline='') as csv_file:
                    file_is_empty = os.stat(self.filename+'.csv').st_size == 0
                    fieldname = ['links']
                    writer = csv.DictWriter(csv_file, fieldnames = fieldname)
                    if file_is_empty:
                        writer.writeheader()
                    writer.writerow({'links':img["src"]})

                img_links.append(img["src"])

What part of your code is relevant to the question? You ask how to remove a part of a string and post 10 lines of code which do all kinds of things and don't even work when copied into an editor (imports and `links` are missing and also irrelevant). — Niklas Mertsch, Mar 12 '21 at 17:08

score 1 · Answer 1 · answered Mar 12 '21 at 17:08

You could use split (assuming the string has 'jpg', otherwise the code below will just return the original url).

string = 'https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
jpg_removed = string.split('.jpg')[0]+'.jpg'

Example

string = 'www.google.com'
com_removed = string.split('.com')[0] 
# com_removed = 'www.google'

EXODIA · Answer 2 · 2021-03-13T09:24:34.050

You can make use of regular expression. You just want to ignore the characters after .jpg so you can some use of something like this:

import re
new_url=re.findall("(.*\.jpg).*",old_url)[0]

(.*\.jpg) is like a capturing group where you're matching any number of characters before .jpg. Since . has a special meaning you need to escape the . in jpg with a \. .* is used to match any number of character but since this is not inside the capturing group () this will get matched but won't get extracted.

score 0 · Answer 3 · answered Mar 12 '21 at 17:06

You can use the .find function to find the characters .jpg then you can index the string to get everything but that. Ex:

string = https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC
index = string.find(".jpg")
new_string = string[:index+ 4]

You have to add four because that is the length of jpg so it does not delete that too.

score 0 · Answer 4 · answered Mar 12 '21 at 17:08

The find() method returns the lowest index of the substring if it is found in given string. If its is not found then it returns -1.

str ='https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
result = str.find('jpg')
print(result)
new_str = str[:result]

print(new_str+'jpg')

score 0 · Answer 5 · answered Mar 12 '21 at 17:17

See: Extracting extension from filename in Python

Instead of extracting the extension, we extract the filename and add the extension (if we know it's always .jpg, it's fine!)

import os
filename, file_extension = os.path.splitext('/path/to/somefile.jpg_corruptedpath')
result = filename + '.jpg'

Now, outside of the original question, I think there might be something wrong with how you got that piece of information int he first place. There must be a better way to extract that jpeg without messing around with the path. Sadly I can't help you with that since I a novice with BeautifulSoup.

score 0 · Answer 6 · answered Mar 12 '21 at 20:56

You could use a regular expression to replace everything after .jpg with an empty string:

import re

url  ='https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'

name = re.sub(r'(?<=\.jpg).*',"",url)

print(name)
https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg

Delete all characters that come after a given string

6 Answers6