Selenium Python - Unable to crawl from page to page

Question

I would like the script to scrape all items from each page and append to a csv file but there are 2 problem :

1) When I run the script it only go to single page (the last page = 64). It doesn't crawl from page 1 until 64

2) When the script writes data to csv file it doesn't append new lines but it re-writes the whole csv file.

import csv
# YouTube Video: https://www.youtube.com/watch?v=zjo9yFHoUl8
from selenium import webdriver

MAX_PAGE_NUM = 67
MAX_PAGE_DIG = 1

driver = webdriver.Chrome('/Users/reezalaq/PycharmProjects/untitled2/venv/driver/chromedriver')

with open('result.csv', 'w') as f:
    f.write("Product Name, Sale Price, Discount, Old Price \n")

for i in range(1, MAX_PAGE_NUM + 1):
    page_num = (MAX_PAGE_DIG - len(str(i))) * "0" + str(i)

url = "https://www.blibli.com/jual/batik-pria?s=batik+pria&c=BA-1000013&i=" + page_num


driver.get(url)


buyers = driver.find_elements_by_xpath("//div[@class='product-title']")
prices = driver.find_elements_by_xpath("//span[@class='new-price-text']")
discount = driver.find_elements_by_xpath("//div[@class='discount']")
oldprice = driver.find_elements_by_xpath("//span[@class='old-price-text']")


num_page_items = len(buyers)
with open('result.csv', 'a') as f:
    for c in range(num_page_items):
        f.write(buyers[c].text + ' , ' + prices[c].text + ' , ' + discount[c].text + ' , ' + oldprice[c].text + '\n')


driver.close()

Indentation problem... A moment I fix it. – Pitto Mar 29 '18 at 12:25 — Pitto, Mar 29 '18 at 12:25

score 0 · Answer 1 · answered Mar 29 '18 at 13:03

If you want to append a new line to the file, you must use "a" argument instead of "w".

with open('result.csv', 'a') as f:
    f.write("Product Name, Sale Price, Discount, Old Price \n")

definition of "w" option:

Opens a file for writing only. Overwrites the file if the file exists. If the file does not exist, creates a new file for writing.

definition of "a" option:

Opens a file for appending. The file pointer is at the end of the file if the file exists. That is, the file is in the append mode. If the file does not exist, it creates a new file for writing.

definition of "ab" option:

Opens a file for appending in binary format. The file pointer is at the end of the file if the file exists. That is, the file is in the append mode. If the file does not exist, it creates a new file for writing.

Therefore, for appending new lines, you must use options that contain "a" (appending option).

Definitions are represented on this answer.

I've changed 'w' to 'a' and now it's appending new results but the csv file header is missing — Reezal AQ, Mar 29 '18 at 13:54
I did, I've installed bs4 but it only scrape from first page — Reezal AQ, Mar 29 '18 at 15:05

Pitto · Accepted Answer · 2018-04-02T13:35:33.200

The main issue you had is an indentation problem that was just running your script using the last object found on the page.

Another issue I saw is that you were just putting all the titiles together, all the old prices together and so on.

For this reason it will be difficult to understand which price belongs to which item in case, for example, of items with missing data.

To solve this issue I've put all the items in a single webpage into the variable "products".

About the "append" or "write" option of the CSV in my implementation I check as first thing if the result.csv file exists.

Then we have two cases:

result.csv doesn't exist: I create it and I put headers in
result.csv already exists: it means that header is already in place and I can simply append new rows when looping

In order to get data out easily I've used BeautifulSoup (install it easily with pip).

There are still several challenges ahead because the data in this webpage is not consistent but the following example should be enough to get you going.

Please keep in mind that the "break" in the code will stop the scraping at the 1st page.

import csv
# YouTube Video: https://www.youtube.com/watch?v=zjo9yFHoUl8
from selenium import webdriver
from bs4 import BeautifulSoup
import os.path

MAX_PAGE_NUM = 67
MAX_PAGE_DIG = 1

driver = webdriver.Chrome('/Users/reezalaq/PycharmProjects/untitled2/venv/driver/chromedriver')
#driver = webdriver.Chrome()

def write_csv_header():
    with open('result.csv', 'w') as f:
        f.write("Product Name, Sale Price, Discount, Old Price \n")

def write_csv_row(product_title, product_new_price, product_discount, product_old_price, product_link):
    with open('result.csv', 'a') as f:
        f.write(product_title + ' , ' + product_new_price + ' , ' + product_discount + ' , ' + product_old_price + ' , ' + product_link + '\n')

if os.path.isfile('result.csv'):
    write_csv_header()

for i in range(1, MAX_PAGE_NUM + 1):
    page_num = (MAX_PAGE_DIG - len(str(i))) * "0" + str(i)
    url = "https://www.blibli.com/jual/batik-pria?s=batik+pria&c=BA-1000013&i=" + page_num
    driver.get(url)
    source = driver.page_source
    soup = BeautifulSoup(source, 'html.parser')
    products = soup.findAll("a", {"class": "single-product"})
    for product in products:
        try:
            product_title = product.find("div", {"class": "product-title"}).text.strip()
        except:
            product_title = "Not available"
        try:
            product_new_price = product.find("span", {"class": "new-price-text"}).text.strip()
        except:
            product_new_price = "Not available"
        try:
            product_old_price = product.find("span", {"class": "old-price-text"}).text.strip()
        except:
            product_old_price = "Not available"
        try:
            product_discount = product.find("div", {"class": "discount"}).text.strip()
        except:
            product_discount = "Not available"
        try:
            product_link = product['href']
        except:
            product_link = "Not available"
        write_csv_row(product_title, product_new_price, product_discount, product_old_price, product_link)
    break # this stops the parsing at the 1st page. I think it is a good idea to check data and fix all discrepancies before proceeding

driver.close()

As I've written you have to remove the "break" to scrape all the pages @ReezalAQ — Pitto, Mar 29 '18 at 15:08
I've just jotted nearly pseudocode, I didn't write the whole program but I fully agree @CoreyGoldberg. I will give a look to your question about href tomorrow, @ReezalAQ! — Pitto, Mar 30 '18 at 21:51
@ReezalAQ I've changed the answer so that it also gets the href. Happy hacking and follow Corey's suggestion about exeptions :) — Pitto, Apr 02 '18 at 13:36

Selenium Python - Unable to crawl from page to page

2 Answers2