0

I'd like a sanity check on this Python script. My goal is to input a list of urls and get a byte size, giving me an indicator if the url is good or bad.

import urllib2
import shutil

urls = (LIST OF URLS)

def getUrl(urls):
    for url in urls:
        file_name = url.replace('https://','').replace('.','_').replace('/','_')
        try:
            response = urllib2.urlopen(url)
        except urllib2.HTTPError, e:
            print e.code
        except urllib2URLError, e:
            print e.args
        print urls, len(response.read())
        with open(file_name,'wb') as out_file:
            shutil.copyfileobj(response, out_file)
getUrl(urls)

The problem I am having is my output looks like:

(LIST OF URLS) 22511
(LIST OF URLS) 56472
(LIST OF URLS) 8717
...

How would I make only one url appear with the byte size?
Is there a better way to get these results?

Jon Phillips
  • 33
  • 1
  • 4

2 Answers2

2

Try

print url, len(response.read())

Instead of

print urls, len(response.read())

You are printing the list each time. Just print the current item.

There are some alternate ways to determine a pages size described here and here there is little point me duplicating that information here.

Edit

Perhaps you would consider using requests instead of urllib2.

You can easily extract only the content-length from the HEAD request and avoid a full GET. e.g.

import requests

h = requests.head('http://www.google.com')

print h.headers['content-length'] 

HEAD request using urllib2 or httplib2 detailed here.

Community
  • 1
  • 1
Paul Rooney
  • 20,879
  • 9
  • 40
  • 61
2

How would I make only one url appear with the byte size?

Obviously: don't

print urls, ...

but

print url, ...
Marcus Müller
  • 34,677
  • 4
  • 53
  • 94