Scraping different image every day from url

Question

I'm trying to write a script in Python that downloads the image on this site that is updated every day:

https://apod.nasa.gov/apod/astropix.html

I was trying to follow the top comment from this post: How to extract and download all images from a website using beautifulSoup?

So, this is what my code currently looks like:

import re
import requests
from bs4 import BeautifulSoup

site = 'https://apod.nasa.gov/apod/astropix.html'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]


for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative 
            # if it is provide the base url which also happens 
            # to be the site variable atm. 
            url = '{}{}'.format(site, url)
        response = requests.get(url)
        f.write(response.content)

However, when I run my program I get this error:

Traceback on line 17
with open(filename.group(1), 'wb' as f:
AttributeError: 'NoneType' object has no attribute 'group'

So it looks like there is some problem with my Regex perhaps?

your image is going to come back corrupt due to its formatting, right now url https://apod.nasa.gov/apod/astropix.htmlimage/1807/FermiFinals1200.jpg when it should be https://apod.nasa.gov/apod/image/1807/FermiFinals1200.jpg — Richard Albright, Jul 23 '18 at 19:32

Andrej Kesely · Accepted Answer · 2018-07-23T19:59:55.403

The regex group() you are looking after is 0, not 1. It contains the image path. Also when the image source path is relative, the url formatting is done incorrectly. I used urllib builtin module to parse the site url:

import re
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse

site = 'https://apod.nasa.gov/apod/astropix.html'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]

for url in urls:
    filename = re.search(r'([\w_-]+[.](jpg|gif|png))$', url)
    filename = re.sub(r'\d{4,}\.', '.', filename.group(0))

    with open(filename, 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative
            # if it is provide the base url which also happens
            # to be the site variable atm.
            hostname = urlparse(site).hostname
            scheme = urlparse(site).scheme
            url = '{}://{}/{}'.format(scheme, hostname, url)

        # for full resolution image the last four digits needs to be striped
        url = re.sub(r'\d{4,}\.', '.', url)

        print('Fetching image from {} to {}'.format(url, filename))
        response = requests.get(url)
        f.write(response.content)

Outputs:

Fetching image from https://apod.nasa.gov/image/1807/FermiFinals.jpg to FermiFinals.jpg

And the image is saved as FermiFinals.jpg

This worked out for me. One more question, is there any way to get a higher resolution of the image? For example, when running this it downloads a 181 KB picture, but if I manually download the image from the site it gives me a 1.48 MB picture. I think the difference occurs when I click the picture and it opens it in a new tab, and then I download it. — K. Hall, Jul 23 '18 at 19:44
@K.Hall Yes, you need to strip last four digits from the URL. I updated my answer. — Andrej Kesely, Jul 23 '18 at 20:00

score 1 · Answer 2 · answered Jul 23 '18 at 19:41

I think the issue is the site variable. When it's all said and done, it's trying to append the image path of site and https://apod.nasa.gov/apod/astropix.html. If you simply just remove the astropix.html it works fine. What I have below is just a small modification of what you have, copy/paste and ship it!

import re
import requests
from bs4 import BeautifulSoup

site = "https://apod.nasa.gov/apod/astropix.html"
site_path_only = site.replace("astropix.html","")

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]

for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative
            # if it is provide the base url which also happens
            # to be the site variable atm.
            url = '{}{}'.format(site_path_only, url)
        response = requests.get(url)
        f.write(response.content)

Note if it's downloading the the image but says it's corrupt and is like 1k in size, you are probably getting a 404 for some reason. Just open the 'image' in notepad and read the the HTML it's giving back.

Scraping different image every day from url

2 Answers2