1

My goal is to scrape data from the PGA website to extract all the golf course locations in the USA. I aim to scrape from the 907 pages the name, address, ownership, phone number, and website.

I have created the script below but when the CSV is created it produces errors. The CSV file created from the script has data repetitions of the first few pages and the pages of the website. It does not give the whole data of the 907 pages.

How can I fix my script so that it will scrape all 907 pages and produce a CSV with all the golf courses listed on the PGA website?

Below is my script:

import csv
import requests 
from bs4 import BeautifulSoup

for i in range(907):      # Number of pages plus one 
     url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
     r = requests.get(url)
     soup = BeautifulSoup(r.content)
g_data2=soup.find_all("div",{"class":"views-field-nothing"})

courses_list=[]

for item in g_data2:
     try:
          name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
     except:
          name=''
     try:
          address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
     except:
          address1=''
     try:
          address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
     except:
          address2=''
     try:
          website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
     except:
          website=''   
     try:
          Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
     except:
          Phonenumber=''      

     course=[name,address1,address2,website,Phonenumber]
     courses_list.append(course)

     with open ('PGA_Data.csv','a') as file:
          writer=csv.writer(file)
          for row in courses_list:
               writer.writerow(row)
dimo414
  • 47,227
  • 18
  • 148
  • 244
Gonzalo68
  • 431
  • 4
  • 11
  • 22
  • I don’t understand. You are fetching page contents 907 times, but only processing it the last time? – xrisk Jun 27 '15 at 04:29
  • I am trying to extract data from 907 pages or instances from the PGA website and I am trying to do it one process by creating a loop that will go through the website and collect all the data. There about 907 pages worth of data I need collecting but my loop is not working. – Gonzalo68 Jun 27 '15 at 04:32
  • Yes, but every time you call `soup = BeautifulSoup(r.content)` you are losing all the data of the previous page. You need to parse the current webpage, before fetching a new one. – xrisk Jun 27 '15 at 04:34
  • By parse, I mean collect all the information and save it to the csv file (the second part of your script) – xrisk Jun 27 '15 at 04:35
  • Thats what I thought I did. How can I go about without losing the data? Can you help me build the script? – Gonzalo68 Jun 27 '15 at 04:36
  • So How do I go about it? I am still at the learning python. Do use another URL or is there something missing from my script? – Gonzalo68 Jun 27 '15 at 04:39
  • Yes you must run soup on the page contents, save the results, and _then_ procede to the new page. – xrisk Jun 27 '15 at 04:40
  • Can you provide a script for me that shows how to do that? I will greatly appreciate it. I am not sure how to do that.. – Gonzalo68 Jun 27 '15 at 04:43
  • Do I just create two parts of my script one having the first page url and the second with the loop? am I getting that right? – Gonzalo68 Jun 27 '15 at 04:49
  • By the way, does this fetch any data at all? Because I am not getting anything. Are you soup selectors correct? – xrisk Jun 27 '15 at 04:49
  • Sorry, sorry. My mistake. They are working. – xrisk Jun 27 '15 at 04:50
  • It works but it makes multiple repetitions of the first coupe of pages then goes into the last three pages. My CSV does not have all the data in it – Gonzalo68 Jun 27 '15 at 04:52

1 Answers1

1

Her is the code that you want. It will first parse the current page before going on to the next one. (There are some blank rows, I hope you can fix that yourself).

import csv
import requests 
from bs4 import BeautifulSoup


def encode(l):
    out = []
    for i in l:
        text = str(i).encode('utf-8')
        out.append(''.join([i if ord(i) < 128 else ' ' for i in text])) #taken from Martjin Pieter's answer 
        # http://stackoverflow.com/questions/20078816/replace-non-ascii-characters-with-a-single-space/20078869#20078869
    return out

courses_list = []
for i in range(5):      # Number of pages plus one 
    url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.content)

    g_data2=soup.find_all("div",{"class":"views-field-nothing"})

    for item in g_data2:
        try:
              name = item.contents[1].find_all("div",{"class":"views-field-title"})[0].text

        except:
              name=''
        try:
              address1= item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
        except:
              address1=''
        try:
              address2= item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
        except:
              address2=''
        try:
              website= item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
        except:
              website=''   
        try:
              Phonenumber= item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
        except:
              Phonenumber=''      

        course=[name,address1,address2,website,Phonenumber]

        courses_list.append(encode(course))


with open ('PGA_Data.csv','a') as file:
          writer=csv.writer(file)
          for row in courses_list:
               writer.writerow(row)

EDIT: After the inevitable problems of unicode encoding/decoding, I have modified the answer and it will (hopefully) work now. But http://nedbatchelder.com/text/unipain.html see this.

xrisk
  • 3,790
  • 22
  • 45
  • I get this error. Hoe do I fix this. /Final_PGA2.py", line 44, in writer.writerow(row) UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 35: ordinal not in range(128) – Gonzalo68 Jun 27 '15 at 15:47
  • @Gonzalo68 Yes, it is a problem with the csv writer, it cannot handle Unicode properly. I am modifying my answer. Check it out. – xrisk Jun 28 '15 at 02:20