1

I'm using Scrapy + Splash to crawl webpages and try to extract data form google ad banners and other ads and I'm having difficulty getting scrapy to follow the xpath into them.

I'm using the Scrpay-Splash API to render the pages so their scripts and images load and to take screenshots but it seems google ad banners are created by JS scripts that then insert their contents into a new html document within an iframe in the webpage, as so:The red area is the iframe container, the blue shows the link I want to extract

Splash makes sure the code is rendered so I don't run into the usual problem scrapy has with scripts where it reads the script's content instead of it's resulting html -- but I can't seem to find a way to indicate the XPath necessary to get to the element nodes I need (ad's href link).

If I inspect the element in google and copy it's xpath it simply gives me //*[@id="aw0"], which I feel would work if the iframe's html was all there was here, but it returns empty no matter how I write it and I fele it's probably because XPath doesn't elegantly handle html documents stacked within html documents.

The XPath to the iframe that contains the google ad code is //*[@id="google_ads_iframe_/87824813/hola/blogs/home_0"]{the numbers are constant}.

Is there a way to stack these XPaths together to get scrapy to follow the trail into the container I need? Or should I be parsing the Splash response object directly in some other way and I can't rely on Response.Xpath/Response.CSS for this?

ConnorU
  • 1,379
  • 2
  • 15
  • 27
  • Did you try to open the request in scrapy shell? See [this SO question](https://stackoverflow.com/questions/35352423/scrapy-shell-and-scrapy-splash) and more specifically the answer of Mikhail Korobov. Using this with the `view(response)` should give you a better chance to find the error / test your xpath. – Casper Jun 20 '17 at 20:46
  • Mikhail's answer [on the other SO question you linked to] seems to be a good hint in the right direction, unfortunately I do not understand what he means or how to do what he suggests. I tried running the request in the scrapy shell, but I can see formt he response that the iframe's contents are not being parsed at all -- which I understand is normal since scrapy does not render them but splash does. I will try to see if I can do this on a splash shell instead. – ConnorU Jun 21 '17 at 20:15

2 Answers2

4

The problem is that iframe content is not returned as a part of html. You can either try to fetch iframe content directly (by its src), or use render.json endpoint with iframes=1 option:

# ...
    yield SplashRequest(url, self.parse_result, endpoint='render.json', 
                        args={'html': 1, 'iframes': 1})

def parse_result(self, response):
    iframe_html = response.data['childFrames'][0]['html']
    sel = parsel.Selector(iframe_html)
    item = {
        'my_field': sel.xpath(...),
        # ...  
    }

/execute endpoint doesn't support fetching iframes content as of Splash 2.3.3.

Mikhail Korobov
  • 21,908
  • 8
  • 73
  • 65
  • I'm getting an error on "sel = parsel.Selector (...)" is this a library I need to import? – ConnorU Jun 21 '17 at 20:04
  • Yes, you need to import it (`import parsel`). Also note that xpath shouldn't include anything outside iframe - process iframe content as a separate document. – Mikhail Korobov Jun 21 '17 at 20:35
  • Also, you may need to take a different iframe; I wrote `['childFrames'][0]` as an example - index could be different. – Mikhail Korobov Jun 21 '17 at 20:35
  • I don't understand - do you mean the number index you took ([0]) could be different or ['childFrames'] could be called differently? – ConnorU Jun 21 '17 at 20:48
  • If the web page has mutiple iframes, and iframe you're interested in is not the first, you'll have to use an appropriate index instead of 0. – Mikhail Korobov Jun 21 '17 at 20:52
  • Ah, got it. Indeed it does, so I'll have to figure out how to filter them until I find the one I want. I'm gonna work on that so I can confirm this but it looks like this is the answer I needed. I'll accept it as soon as I finish making sure, thanks! – ConnorU Jun 21 '17 at 20:57
0

An alternative way to deal with iframe can be (response if the main page):

    urls = response.css('iframe::attr(src)').extract()
    for url in urls :
            parse the url

this way the iframe is parsed like it was a normal page, but at the moment i cannot send the cookies in the main page to the html inside the iframe and that's a problem

chairam
  • 335
  • 5
  • 12