Run scrapy on a set of hundred plus urls

Question

I need to download CPU and GPU data of a set of phones fro gsmarena. Now as a step one, I downloaded the urls of those phones by running scrapy and deleted the unnecessary items.

COde for the same is below.

# -*- coding: utf-8 -*-
from  scrapy.selector import Selector
from scrapy import Spider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from gsmarena_data.items import gsmArenaDataItem


class MobileInfoSpider(Spider):
name = "mobile_info"
allowed_domains = ["gsmarena.com"]
start_urls = (
# 'http://www.gsmarena.com/samsung-phones-f-9-10.php',
# 'http://www.gsmarena.com/apple-phones-48.php',
# 'http://www.gsmarena.com/microsoft-phones-64.php',
# 'http://www.gsmarena.com/nokia-phones-1.php',
# 'http://www.gsmarena.com/sony-phones-7.php',
# 'http://www.gsmarena.com/lg-phones-20.php',
# 'http://www.gsmarena.com/htc-phones-45.php',
# 'http://www.gsmarena.com/motorola-phones-4.php',
# 'http://www.gsmarena.com/huawei-phones-58.php',
# 'http://www.gsmarena.com/lenovo-phones-73.php',
# 'http://www.gsmarena.com/xiaomi-phones-80.php',
# 'http://www.gsmarena.com/acer-phones-59.php',
# 'http://www.gsmarena.com/asus-phones-46.php',
# 'http://www.gsmarena.com/oppo-phones-82.php',
# 'http://www.gsmarena.com/blackberry-phones-36.php',
# 'http://www.gsmarena.com/alcatel-phones-5.php',
# 'http://www.gsmarena.com/xolo-phones-85.php',
# 'http://www.gsmarena.com/lava-phones-94.php',
# 'http://www.gsmarena.com/micromax-phones-66.php',
# 'http://www.gsmarena.com/spice-phones-68.php',
'http://www.gsmarena.com/gionee-phones-92.php',
)

def parse(self, response):
phone = gsmArenaDataItem()
hxs = Selector(response)
phone_listings = hxs.css('.makers')

for phone_listings in phone_listings:
phone['model'] = phone_listings.xpath("ul/li/a/strong/text()").extract()
phone['link'] = phone_listings.xpath("ul/li/a/@href").extract()
yield phone

Now, I need to run scrapy on those set of urls to get the CPU and GPU data. All that info comes css selector = ".ttl".

Kindly guide how to loop scrapy on the set of urls and output the data in a single csv or json. I'm well aware will creating items and using css selectors. Need help with how to loop on those hundred plus pages.

I have a list of urls like:

www.gsmarena.com/samsung_galaxy_s5_cdma-6338.php
www.gsmarena.com/samsung_galaxy_s5-6033.php
www.gsmarena.com/samsung_galaxy_core_lte_g386w-6846.php
www.gsmarena.com/samsung_galaxy_core_lte-6099.php
www.gsmarena.com/acer_iconia_one_8_b1_820-7217.php
www.gsmarena.com/acer_iconia_tab_a3_a20-7136.php
www.gsmarena.com/microsoft_lumia_640_dual_sim-7082.php
www.gsmarena.com/microsoft_lumia_532_dual_sim-6951.php

Which are the links to phone descriptions on gsm arena.

Now I need to download the CPU and GPU info of the 100 models I have.

    I extracted the urls of those 100 models for which the data is required.

    The spider written for the same is,

    from  scrapy.selector import Selector
    from scrapy import Spider
    from gsmarena_data.items import gsmArenaDataItem

    class MobileInfoSpider(Spider):
    name = "cpu_gpu_info"
    allowed_domains = ["gsmarena.com"]
    start_urls = (
    "http://www.gsmarena.com/microsoft_lumia_435_dual_sim-6949.php",
    "http://www.gsmarena.com/microsoft_lumia_435-6942.php",
    "http://www.gsmarena.com/microsoft_lumia_535_dual_sim-6792.php",
    "http://www.gsmarena.com/microsoft_lumia_535-6791.php",
    )
    def parse(self, response):
    phone = gsmArenaDataItem()
    hxs = Selector(response)
    cpu_gpu = hxs.css('.ttl')
    for phone_listings in phone_listings:
    phone['cpu'] = cpu_gpu.xpath("ul/li/a/strong/text()").extract()
    phone['gpu'] = cpu_gpu.xpath("ul/li/a/@href").extract()
    yield phone

If somehow I could run on the urls for which I want to extract this data, I could get the required data in a single csv file.

can you show us what you have so far? Where are you getting stuck? — Vincent De Smet, May 31 '15 at 13:51

score 0 · Answer 1 · edited May 23 '17 at 11:58

0

I think you need information from every vendors. If so you don't have to put those hundreds of urls in the start-url, alternatively you can use this link as start-url after that in parse() you could extract those urls programatically and process what you want.

This answer will help you to do so.

edited May 23 '17 at 11:58

Community

1
1

answered Jun 01 '15 at 05:18

Jithin

1,692
17
25

That's almost everything I needed. Thanks for the wonderful code reference. I don't have 15 reputations hence can't upvote your comment. – ajhavery Jun 05 '15 at 17:33
Can you please help me with extract path too. Let's say, on the page: "http://www.gsmarena.com/htc_desire_820-6636.php", Every data is a table element. How do i extract all table row elements? ANy reference would be helpful. Sorry for being a noob. This is my first experience with scrapy. – ajhavery Jun 05 '15 at 17:36
I explained my progress here: "http://stackoverflow.com/questions/30673602/extract-data-from-a-gsmarena-page-using-scrapy" – ajhavery Jun 05 '15 at 18:42

Run scrapy on a set of hundred plus urls

1 Answers1