-3

I have a script that parses HTML and saves the images to disk. However, for some reason it outputs the filename wrongly.

It is not saving the file with the correct file extension in Windows. Eg, the image should be saved as <filename>.jpg or <filename>.gif. Instead the images are being saved with no filename extension.

Could you help me to see why this script is not saving the extension correctly in the filename?

I'm running Python 2.7.

""" Tumbrl downloader
This program will download all the images from a Tumblr blog """


from urllib import urlopen, urlretrieve
import os, sys, re


def download_images(images, path):
  for im in images:
    print(im)
    filename = re.findall("([^/]*).(?:jpg|gif|png)",im)[0]
    filename = os.path.join(path,filename)
    try:
      urlretrieve(im, filename.replace("500","1280"))
    except:
      try:
        urlretrieve(im, filename)
      except:
        print("Failed to download "+im)

def main():

  #Check input arguments
  if len(sys.argv) < 2:
    print("usage: ./tumblr_rip.py url [starting page]")
    sys.exit(1)

  url = sys.argv[1]

  if len(sys.argv) == 3:
    pagenum = int(sys.argv[2])
  else:
    pagenum = 1

  if (check_url(url) == ""):
    print("Error: Malformed url")
    sys.exit(1)

  if (url[-1] != "/"):
    url.append("/")

  blog_name = url.replace("http://", "")
  blog_name = re.findall("(?:.[^\.]*)", blog_name)[0]
  current_path = os.getcwd()
  path = os.path.join(current_path, blog_name)
  #Create blog directory
  if not os.path.isdir(path):
    os.mkdir(path)

  html_code_old = ""
  while(True):
    #fetch html from url
    print("\nFetching images from page "+str(pagenum)+"\n")
    f = urlopen(url+"page/"+str(pagenum))
    html_code = f.read()
    html_code = str(html_code)
    if(check_end(html_code, html_code_old, pagenum)):
      break

    images = get_images_page(html_code)
    download_images(images, path)

    html_code_old = html_code
    pagenum += 1


  print("Done downloading all images from " + url)


if __name__ == '__main__':
  main()
BBedit
  • 7,037
  • 7
  • 37
  • 50

1 Answers1

3

The line

filename = re.findall("([^/]*).(?:jpg|gif|png)",im)[0]

Does not do what you think it does. First off, the dot is unescaped, meaning it will match any character, not just a period.

But the bigger problem is that you messed up the groups. You're acessing the value of the first group in the match, which is the first part inside parenthesis, giving you only the base filename without extension. The second group, containing the extension, is a seperate, noncapturing group. The (?:...) syntax makes a group noncapturing.

The way I fixed it was by putting a group around the entire match and making the existing groups noncapturing.

re.findall("((?:[^/]*)\.(?:jpg|gif|png))",im)[0]

P.S. Another problem is that the pattern is greedy so it can match multiple filenames at once. However, this isn't necessarily invalid, since spaces and periods are allowed in filenames. So if you want to match multiple filenames here, you'll have to figure out what to do yourself. Something like "((?:\w+)\.(?:jpg|gif|png))" would be more intuitive though.

Antimony
  • 37,781
  • 10
  • 100
  • 107