How to remove suffix from scraped links?

Question

I'm looking for a solution to get full-size images from a website.

By using the code I recently finished through someone's help on stackoverflow, I was able to download both full-size images and down-sized images.

What I want is for all downloaded images to be full-sized.

For example, some image filenames have "-625x417.jpg" as a suffix, and some images don't have it.

https://www.bikeexif.com/1968-harley-davidson-shovelhead (has suffix) https://www.bikeexif.com/harley-panhead-walt-siegl (None suffix)

If this suffix can be removed, then it'll be a full-size image.

https://kickstart.bikeexif.com/wp-content/uploads/2018/01/1968-harley-davidson-shovelhead-625x417.jpg (Scraped) https://kickstart.bikeexif.com/wp-content/uploads/2018/01/1968-harley-davidson-shovelhead.jpg (Full-size image's filename if removed: -625x417)

Considering there's a possibility that different image resolutions exist as filenames, So it needed to be removed in a different size too.

I guess I may need to use regular expressions to filter out '- 3digit x 3digit' from below.

But I really don't have any idea how to do that.

If you can do that, please help me finish this. Thank you!

images_url = selector_article.xpath('//div[@id="content"]//img/@src').getall() + \
             selector_article.xpath('//div[@id="content"]//img/@data-src').getall()

Full Code:

import requests
import parsel
import os

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}

for page in range(1, 310):
    print(f'======= Scraping data from page {page} =======')

    url = f'https://www.bikeexif.com/page/{page}'

    response = requests.get(url, headers=headers)
    selector = parsel.Selector(response.text)

    containers = selector.xpath('//div[@class="container"]/div/article[@class="smallhalf"]')

    for v in containers:

        old_title = v.xpath('.//div[2]/h2/a/text()').get()
        
        if old_title is not None:
            title = old_title.replace(':', ' -').replace('?', '')

        title_url = v.xpath('.//div[2]/h2/a/@href').get()
        print(title, title_url)

        os.makedirs( os.path.join('bikeexif', title), exist_ok=True )

        response_article = requests.get(url=title_url, headers=headers)
        selector_article = parsel.Selector(response_article.text)

        # Need to get full-size images only
        # (* remove if suffix exist, such as -625x417, if different size of suffix exist, also need to remove)
        images_url = selector_article.xpath('//div[@id="content"]//img/@src').getall() + \
                    selector_article.xpath('//div[@id="content"]//img/@data-src').getall()
        print('len(images_url):', len(images_url))

        for img_url in images_url:

            response_image = requests.get(url=img_url, headers=headers)

            filename = img_url.split('/')[-1]
            
            with open( os.path.join('bikeexif', title, filename), 'wb') as f:
                f.write(response_image.content)
                print('Download complete!!:', filename)

Are you familiar with [regular expressions](https://docs.python.org/3/library/re.html)? They take a while to learn but they are extremely powerful for this type of thing and are part of many programming languages, not just python. — ramzeek, Mar 05 '22 at 21:09
@ramzeek I don't know much about it. I used to try to learn it, but I felt it's too hard for me. — 2752353, Mar 05 '22 at 21:13
I don't use them much in python, so I'll try to get a working example for you but definitely worth learning if you expect to do a lot of text manipulation. — ramzeek, Mar 05 '22 at 21:15
@ramzeek Yeah, when everytime I've encounter this kind of situation, I feel like I need to upgrade myself. But I don't even good with python basic. Anyway thanks for your tips. — 2752353, Mar 05 '22 at 21:16
I have never stopped learning and feel like I'm constantly upgrading. :) — ramzeek, Mar 05 '22 at 21:25

score 0 · Accepted Answer · answered Mar 05 '22 at 21:24

0

I would go with something like this:

import re

url = 'https://kickstart.bikeexif.com/wp-content/uploads/2018/01/1968-harley-davidson-shovelhead-625x417.jpg'

new_url = re.sub('(.*)-\d+x\d+(\.jpg)', r'\1\2', url)
#https://kickstart.bikeexif.com/wp-content/uploads/2018/01/1968-harley-davidson-shovelhead.jpg

Explanation (see also here):

The regular expression is broken into three parts: (.*) means basically any set of characters of any length, the parentheses group them together.
-\d+x\d+ means the dash, followed by one or more digits, followed by x followed by 1 or more digits.
the last part is simply .jpg, we use the \ because . is a special character with regular expressions and so the slash escapes to know we mean a . rather than "0 or more"

In the second part of the re.sub we have \1\2 which means "whatever was in the first set of parenthesis in the first part" and "whatever was in the second set of parentheses in the first part".

Finally, the last part is just your string that you want to parse.

answered Mar 05 '22 at 21:24

ramzeek

2,226
12
23

Thanks a lot. I'll give it a try. – 2752353 Mar 05 '22 at 21:30
I've edited my code like this # images_url = selector_article.xpath('//div[@id="content"]//img/@src').getall() + \ selector_article.xpath('//div[@id="content"]//img/@data-src').getall() new_images_url = re.sub('(.*)-\d+x\d+(\.jpg)', r'\1\2', images_url)........... and I got File "C:\Python39\lib\re.py", line 210, in sub return _compile(pattern, flags).sub(repl, string, count) TypeError: expected string or bytes-like object – 2752353 Mar 05 '22 at 21:39
I guess the first scaraped link don't have suffix, so it may caused an error. 'If' and 'Else' bypassing needed for this maybe? If so, how to do that? – 2752353 Mar 05 '22 at 21:49
I don't have parsel so can't run your code, what's is `images_url` when you get that error? – ramzeek Mar 06 '22 at 00:00
oh, come on. You don't know how to install a library? > pip install parsel (*As I mentioned in above comment, I added one new line > new_images_url = re.sub (*your code) – 2752353 Mar 06 '22 at 00:13
@2752353, please try to be polite. This is not a free debugging service, and I did not want to install parsel. I was hoping you would be interested in learning how to figure out how to find the answer yourself. Hopefully you managed. – ramzeek Mar 06 '22 at 05:15

How to remove suffix from scraped links?

1 Answers1