1
import requests
from bs4 import BeautifulSoup
from lxml import etree
import csv

with open('1_colonia.csv', 'r', encoding='utf-8') as csvfile:
    reader = csv.reader(csvfile, delimiter=';')
    next(reader)  # skip the header row
    for row in reader:
        url = row[0]
        page = requests.get(url)
        # parse the html with BeautifulSoup
        soup = BeautifulSoup(page.content, 'html.parser')
        # parse the HTML and print the result to the console
        dom = etree.HTML(str(soup))
        property = (dom.xpath('//*[@id="header"]/div/div[2]/h1'))
        duration = (dom.xpath('//*[@id="header"]/div/p'))
        price = (dom.xpath('//*[@id="price"]/div/div/span/span[3]'))
        # save the data to a CSV file, adding the url as a column to the CSV file
        with open('2_colonia.csv', 'a', newline='', encoding='utf-8') as csvfile:
            writer = csv.writer(csvfile, delimiter=';') 
            writer.writerow([url, property[0].text, duration[0].text,price[0].text])

'1_colonia.csv' contains a list of 815 links of properties on sale. The script works until this message appears:

Traceback (most recent call last):
  File "/home/flimflam/Python/colonia/2_colonia.py", line 23, in <module>
    writer.writerow([url, property[0].text, duration[0].text, price[0].text])
IndexError: list index out of range

I am not sure where the problem lies. Can anyone help me out, please? Thanks,

HedgeHog
  • 22,146
  • 4
  • 14
  • 36
  • 1
    Does this answer your question? [Does "IndexError: list index out of range" when trying to access the N'th item mean that my list has less than N items?](https://stackoverflow.com/questions/1098643/does-indexerror-list-index-out-of-range-when-trying-to-access-the-nth-item-m) – HedgeHog Sep 19 '22 at 18:52
  • Danke schön, HedgeHog, for editing the question. And, no , the link you sent me does not help me. – Johnny FlimFlam Sep 19 '22 at 18:53
  • 1
    Check the answer to the question in the link again - Then check your extracted `lists`, there seems to be an empty one. – HedgeHog Sep 19 '22 at 19:18
  • I found the error. One of the elements I want to scrape has the following XPath: duration = **(dom.xpath('//*[@id="header"]/div/p'))**. But, in some pages, it turns into **(dom.xpath('//*[@id="header"]/div/div[3]/p'))**. And this difference was causing the IndexError: list index out of range. – Johnny FlimFlam Sep 19 '22 at 22:17

1 Answers1

0

xpath returns lists (for the kind of expression you are using), so in your script property, duration and price are lists.

Depending on what you're searching, xpath can return 0, 1 or multiple elements.

So you must check whether there are results on the list before accessing them. If the list is empty and you try to access the first element (as in property[0], for instance) you will get an exception.

A simple way of checking if there's data on your lists before writing to the csv file would be:

with open('2_colonia.csv', 'a', newline='', encoding='utf-8') as csvfile:
  writer = csv.writer(csvfile, delimiter=';')
  # check if the lists are not empty
  if len(property) > 0 and len(duration) > 0 and len(price) > 0:
    writer.writerow([url, property[0].text, duration[0].text, price[0].text])
  else:
    writer.writerow([url, 'error'])
Lelo
  • 854
  • 11
  • 25