0

I'm trying to write a script in Python that downloads the image on this site that is updated every day:

https://apod.nasa.gov/apod/astropix.html

I was trying to follow the top comment from this post: How to extract and download all images from a website using beautifulSoup?

So, this is what my code currently looks like:

import re
import requests
from bs4 import BeautifulSoup

site = 'https://apod.nasa.gov/apod/astropix.html'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]


for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative 
            # if it is provide the base url which also happens 
            # to be the site variable atm. 
            url = '{}{}'.format(site, url)
        response = requests.get(url)
        f.write(response.content)

However, when I run my program I get this error:

Traceback on line 17
with open(filename.group(1), 'wb' as f:
AttributeError: 'NoneType' object has no attribute 'group'

So it looks like there is some problem with my Regex perhaps?

K. Hall
  • 169
  • 2
  • 11
  • your image is going to come back corrupt due to its formatting, right now url https://apod.nasa.gov/apod/astropix.htmlimage/1807/FermiFinals1200.jpg when it should be https://apod.nasa.gov/apod/image/1807/FermiFinals1200.jpg – Richard Albright Jul 23 '18 at 19:32

2 Answers2

1

The regex group() you are looking after is 0, not 1. It contains the image path. Also when the image source path is relative, the url formatting is done incorrectly. I used urllib builtin module to parse the site url:

import re
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse

site = 'https://apod.nasa.gov/apod/astropix.html'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]

for url in urls:
    filename = re.search(r'([\w_-]+[.](jpg|gif|png))$', url)
    filename = re.sub(r'\d{4,}\.', '.', filename.group(0))

    with open(filename, 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative
            # if it is provide the base url which also happens
            # to be the site variable atm.
            hostname = urlparse(site).hostname
            scheme = urlparse(site).scheme
            url = '{}://{}/{}'.format(scheme, hostname, url)

        # for full resolution image the last four digits needs to be striped
        url = re.sub(r'\d{4,}\.', '.', url)

        print('Fetching image from {} to {}'.format(url, filename))
        response = requests.get(url)
        f.write(response.content)

Outputs:

Fetching image from https://apod.nasa.gov/image/1807/FermiFinals.jpg to FermiFinals.jpg

And the image is saved as FermiFinals.jpg

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • This worked out for me. One more question, is there any way to get a higher resolution of the image? For example, when running this it downloads a 181 KB picture, but if I manually download the image from the site it gives me a 1.48 MB picture. I think the difference occurs when I click the picture and it opens it in a new tab, and then I download it. – K. Hall Jul 23 '18 at 19:44
  • 1
    @K.Hall Yes, you need to strip last four digits from the URL. I updated my answer. – Andrej Kesely Jul 23 '18 at 20:00
1

I think the issue is the site variable. When it's all said and done, it's trying to append the image path of site and https://apod.nasa.gov/apod/astropix.html. If you simply just remove the astropix.html it works fine. What I have below is just a small modification of what you have, copy/paste and ship it!

import re
import requests
from bs4 import BeautifulSoup

site = "https://apod.nasa.gov/apod/astropix.html"
site_path_only = site.replace("astropix.html","")

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]

for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative
            # if it is provide the base url which also happens
            # to be the site variable atm.
            url = '{}{}'.format(site_path_only, url)
        response = requests.get(url)
        f.write(response.content)

Note if it's downloading the the image but says it's corrupt and is like 1k in size, you are probably getting a 404 for some reason. Just open the 'image' in notepad and read the the HTML it's giving back.

sniperd
  • 5,124
  • 6
  • 28
  • 44