0

For the first print tag I am getting a large list of hundreds of <a tags. For the second print tag I am getting a list with four <a tags, not including the ones that I want.

One of the tags that tags that I want is at the end of tags. After printing all several hundred tags, I am printing the last tag, and that is printing the correct end tag as it should. But then by running another for loop over the same (unchanged) list tags I am not just getting a different result, but significantly different.

With or without the `print '\n\n\n' the phenomenon is happening, it's just to make the split between the two prints easier for me to see.

What is happening to this list in between the first and second for loop to cause this problem?

(This code is exactly as I have it in my script. Originally I didn't have the lines from the first for loop until the empty line, and am doing this to debug the lack of the correct URL from the end result.)

EDIT: Also, here is what is being printed for all the print statements (only the last section of the first print within the for loop):

import urllib
from bs4 import BeautifulSoup

startingList = ['http://www.stowefamilylaw.co.uk/']
for url in startingList:
    try:
        html = urllib.urlopen(url)
        soup = BeautifulSoup(html,'lxml')
        tags = soup('a')
        for tag in tags:
            print tag
        print tags[-1]
        print '\n\n\n'

        for tag in tags:
            print tag
            if not tag.get('href', None).startswith('..'):
                continue
    except:
        continue

....

<a class="shiftnav-target" href="http://www.stowefamilylaw.co.uk/faq-category/decrees-orders-forms/" itemprop="url">Decrees, Orders &amp; Forms</a>
<a class="shiftnav-target" href="http://www.stowefamilylaw.co.uk/faq-category/international-divorce/" itemprop="url">International Divorce</a>
<a class="shiftnav-target"><i class="fa fa-chevron-left"></i> Back</a>
<a class="shiftnav-target" href="http://www.stowefamilylaw.co.uk/contact/" itemprop="url"><i class="fa fa-phone"></i> Contact</a>
<a class="shiftnav-target" href="http://www.stowefamilylaw.co.uk/contact/" itemprop="url"><i class="fa fa-phone"></i> Contact</a>




<a href="http://www.stowefamilylaw.co.uk/">Stowe Family Law</a>
<a href="#spu-5086" style="color: #fff"><div class="callbackbutton"><i class="fa fa-phone" style="font-size: 16px"></i> Request Callback </div></a>
<a href="#spu-5084" style="color: #fff"><div class="callbackbutton"><i class="fa fa-envelope-o" style="font-size: 16px"></i> Quick Enquiry </div></a>
<a class="ubermenu-responsive-toggle ubermenu-responsive-toggle-main ubermenu-skin-black-white-2 ubermenu-loc-primary" data-ubermenu-target="ubermenu-main-3-primary"><i class="fa fa-bars"></i>Main Menu</a>
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
DanielSon
  • 1,415
  • 4
  • 27
  • 40
  • Now edited with code that accurately displays the issue in the most minimal way I can. Let me know if there's anything else that needs to be adjusted/emphasised. If this is ok now, please allow others to try to help with this problem as well. Thanks Martijn – DanielSon Jul 03 '16 at 10:48
  • You have a blanket `except`. Remove that or replace it with something that prints the error. I bet it is the `.startswith()` that throws an attribute error when the `.get()` returns `None`. That ends the loop. – Martijn Pieters Jul 03 '16 at 10:57
  • Yes if I replace the `continue` for `except` with `print "this is an error"` it prints everything from previously with the error message as well. – DanielSon Jul 03 '16 at 11:02
  • I just added a try and except to the if statement and that seems to have solved the problem. Thanks for your advice, as well as the explanation about how to contsruct these types of questions from now on – DanielSon Jul 03 '16 at 11:06
  • See also http://blog.codekills.net/2011/09/29/the-evils-of--except--/ In general the `try` block should be as short as possible, and the `except` as specific as possible. I think making the second argument to `tag.get` a string would be neater and more efficient than wrapping a simple conditional with `try` (which also wraps the code *inside* the `if`, again making the block longer than is ideal). – jonrsharpe Jul 03 '16 at 11:20

1 Answers1

3

You have a blanket except::

try:
    # ...
except:
    continue

so any error in the block will be masked and your loop will be skipped. Don't use blanket except handlers without raising again, ever, see Why is "except: pass" a bad programming practice?. At the very least catch only Exception and print that error:

except Exception as e:
    print 'Encountered:', e

Without proper diagnostics all we can do is guess.

One error you definitely have is an attribute error here when there is no href attribute; the None object doesn't have an attribute startswith:

if not tag.get('href', None).startswith('..'):

Instead of None return an empty string:

if not tag.get('href', '').startswith('..'):

or better yet, select only a tags with an href attribute:

tags = soup.select('a[href]')
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343