0

I am a beginner of Python. Could someone point out why it keeps saying

Traceback (most recent call last):
  File "C:/Python27/practice example/datascraper templates.py", line 21, in <module>
    print findPatTitle[i]
IndexError: list index out of range

Thanks a lot.

Here are the codes:

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re

webpage=urlopen('http://www.voxeu.org/').read()

patFinderTitle=re.compile('<title>(.*)</title>')      ##title tag
patFinderLink=re.compile('<link rel.*href="(.*)"/>')  ##link tag

findPatTitle=re.findall(patFinderTitle,webpage)
findPatLink=re.findall(patFinderLink,webpage)

listIterator=[]
listIterator=range(2,16)

for i in listIterator:
    print findPatTitle[i]
    print findPatLink[i]
    print '/n'
  • Do you have some prior knowledge of the page? Why are you looking for that specific range of results from the two lists ? – msturdy May 21 '14 at 21:45
  • 2
    The *real* answer here is: [**Don't parse HTML with regexes!**](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Jonathon Reinhart May 21 '14 at 21:51
  • What is the value of i when it gets the error and how many entries in the lists? Apart from the don't use regex for HTML - why have you hard coded 2 and 16 and not just taken the length of the lists? – mmmmmm May 21 '14 at 22:05
  • You're even importing `BeautifulSoup` already! *Use it!* – Jonathon Reinhart May 21 '14 at 22:07

2 Answers2

2

The error message is perfectly descriptive.

You're trying to access a hard-coded range of indices (2,16) into findPatTitle, but you have no idea how many items there are.

When you want to iterate over multiple similar collections simultaneously, use zip().

for title, link in zip(findPatTitle,  findPatLink):
    print 'Title={0} Link={1}'.format(title, link)
Jonathon Reinhart
  • 132,704
  • 33
  • 254
  • 328
  • there is only one element in `findPatTitle` – Padraic Cunningham May 21 '14 at 21:47
  • @PadraicCunningham In one isolated test case? `re.findall` returns "all non-overlapping matches of pattern in string, as a list of strings." – Jonathon Reinhart May 21 '14 at 21:49
  • I just mean in his example, I printed his list there is actually nothing in it, his regex is finding nothing. – Padraic Cunningham May 21 '14 at 21:54
  • @PadraicCunningham Right, but that's based on some external input (the content of that web page). Who writes programs based on some fixed input? What would the point of the program be? The point of my answer is to show how to do this correctly, regardless of inputs. You know what they say happens when you *assume*. – Jonathon Reinhart May 21 '14 at 22:05
  • yep, I totally understand. I just mentioned it as I thought it was worth pointing out that using regex was also not a very good idea. – Padraic Cunningham May 21 '14 at 22:08
0

The problem is you have a different number of results than you expected. Don't hard-code that. But let's also rewrite this to be a bit more pythonic:

Replace this:

listIterator=[]
listIterator=range(2,16)

for i in listIterator:
    print findPatTitle[i]
    print findPatLink[i]
    print '/n'

with the two lists zipped together:

for title, link in zip(findPatTitle, findPatLink):
    print title
    print link
    print '/n'

This will loop over both at once, however long the list is. 1 element or 100 elements, it makes no difference.

mhlester
  • 22,781
  • 10
  • 52
  • 75
  • Thanks for all the discussions and solutions. I actually have no coding experiences before (except for Matlab). What is the efficient way to learn data scraping. – user3662676 May 23 '14 at 19:03