0

I am building a simple scraper with Scrapy but am having issues extracting certain parts of the data. The website contains about 20 of the following blocks of code:

 <div class="row result">
    <div class="updateCont date col-md-2 col-sm-2 col-xs-3">
         <span>    
            <strong>Fri. 10 Feb</strong> <br />0:00 AM
         </span>
    </div>
    <div class="updateCont eventIcon col-md-1 col-sm-1 col-xs-3">
        <div class="icon ">
            <i class="fa fa-update"></i>
        </div>
    </div>
    <div class="updateCont event col-md-9 col-sm-8 col-xs-6">
        <span> 
              The buyer has been notified of this update. <br />
              <span class="inner department">
                  124
              </span>
        </span>
    </div>
</div>

I have managed to extract each one of these with:

sel = Selector(text=response.body)
updates =  sel.xpath("//div[@class='row result']")

I now would like to isolate the date and convert it into a datetime object as well as the updateCont event string. The buy has been notified of this update.

I tried:

for update in updates:
        date = update.xpath('//span').extract()
        print ( len(date) )

which results in 7. I was expecting it to be 3. More worringly, if I print out just date it prints out the same data three times. I was expecting three different lot of data as there are three separate in the html.

Is

sel = Selector(text=response.body)
updates =  sel.xpath("//div[@class='row result']")

the correct code to isolate the sections? What would be the best approach to extract the spans?

1 Answers1

-2
In [19]: for update in updates:
    ...:         spans = update.xpath('//span')
    ...:         for span in spans:
    ...:             text = span.xpath('normalize-space()').extract_first()
    ...:             print(text)
    ...:             
    ...:    

out:

Fri. 10 Feb 0:00 AM
The buyer has been notified of this update. 124
124

Use . to isolate it to current node

宏杰李
  • 11,820
  • 2
  • 28
  • 35
  • Thanks for that it worked. The only small issue is now `Fri. 10 Feb
    0:00 AM` will only extract the 0:00AM and not the bit within the strong tag.
    –  Feb 10 '17 at 17:24
  • I am still not getting the bit in the strong tag. Once it works fully I will of course accept the answer. –  Feb 10 '17 at 20:03