1

I am scraping 2 sets of data from a website using beautiful soup and I want them to output in a csv file in 2 columns side by side. I am using spamwriter.writerow([x,y]) argument for this but I think because of some error in my recursion structure, I am getting the wrong output in my csv file. Below is the referred code:

import csv
import urllib2
import sys  
from bs4 import BeautifulSoup
page = urllib2.urlopen('http://www.att.com/shop/wireless/devices/smartphones.html').read()
soup = BeautifulSoup(page)
soup.prettify()
with open('Smartphones_20decv2.0.csv', 'wb') as csvfile:
    spamwriter = csv.writer(csvfile, delimiter=',')        
    for anchor in soup.findAll('a', {"class": "clickStreamSingleItem"},text=True):
        if anchor.string:
            print unicode(anchor.string).encode('utf8').strip()         

    for anchor1 in soup.findAll('div', {"class": "listGrid-price"}):
        textcontent = u' '.join(anchor1.stripped_strings)
        if textcontent:
            print textcontent
            spamwriter.writerow([unicode(anchor.string).encode('utf8').strip(),textcontent])

Output which I am getting in csv is:

Samsung Focus® 2 (Refurbished) $99.99
Samsung Focus® 2 (Refurbished) $99.99 to $199.99 8 to 16 GB
Samsung Focus® 2 (Refurbished) $0.99
Samsung Focus® 2 (Refurbished) $0.99
Samsung Focus® 2 (Refurbished) $149.99 to $349.99 16 to 64 GB

Problem is I am getting only 1 device name in column 1 instead of all while price is coming for all devices. Please pardon my ignorance as I am new to programming.

1 Answers1

1

You are using anchor.string, instead of archor1. anchor is the last item from the previous loop, instead of the item in the current loop.

Perhaps using clearer variable names would help avoid confusion here; use singleitem and gridprice perhaps?

It could be I misunderstood though and you want to combine each anchor1 with a corresponding anchor. You'll have to loop over them together, perhaps using zip():

items = soup.findAll('a', {"class": "clickStreamSingleItem"},text=True)
prices = soup.findAll('div', {"class": "listGrid-price"})
for item, price in zip(items, prices):
    textcontent = u' '.join(price.stripped_strings)
    if textcontent:
        print textcontent
        spamwriter.writerow([unicode(item.string).encode('utf8').strip(),textcontent])

Normally it should be easier to loop over the parent table row instead, then find the cells within that row within a loop. But the zip() should work too, provided the clickStreamSingleItem cells line up with the listGrid-price matches.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Can you help me in pinpointing the changes to be made in code so that my output is w/o any special characters? Eg. I want this name "Samsung Rugby® Smart" to be as "Samsung Rugby Smart" . –  Dec 20 '12 at 09:23
  • @user1915050: You are seeing the latin-1 interpretation of your UTF-8 character. Open the file with an editor that can read UTF-8 and you'll see that it is the `®` character instead. It's unicode codepoint `\u00AE` is encoded to `\xC2\xAE` in UTF-8, and when your file viewer assumes it's latin-1 text instead it'll show `Â` for the `\xC2` byte. – Martijn Pieters Dec 20 '12 at 09:46
  • Okay, Can you help me in invoking "on click event" for a button in javascript. Detailed problem is explained in this question: http://stackoverflow.com/questions/13967454/issues-with-invoking-on-click-event-on-the-html-page-using-beautiful-soup-in-p –  Dec 20 '12 at 10:36
  • Is there a way to remove these special characters from output csv using some method in python code? –  Dec 21 '12 at 08:07
  • @user1915050: You could encode to `ASCII` with `errors` set to `ignore`; you'll remove all accents too then though: `unicode(item.string).encode('ascii', 'ignore')`; `ignore` tries to encode everything to ASCII and will skip anything that doesn't 'fit'. – Martijn Pieters Dec 21 '12 at 08:50