2

I'm trying to use scrapy to get content rendered only after a javascript: link is clicked. As the links don't appear to follow a systematic numbering scheme, I don't know how to

1 - activate a javascript: link to expand a collapsed panel

2 - activate a (now visible) javascript: link to cause the popup to be rendered so that its content (the abstract) can be scraped

The site https://b-com.mci-group.com/EventProgramme/EHA19.aspx contains links to abstracts that will be presented at a conference I plan to attend. The site's export to PDF is buggy, in that it duplicates a lot of data at PDF generation time. Rather than dealing with the bug, I turned to scrapy only to realize that I'm in over my head. I've read:

Can scrapy be used to scrape dynamic content from websites that are using AJAX?

and

How to scrape coupon code of coupon site (coupon code comes on clicking button)

But I don't think I'm able to connect the dots. I've also seen mentions to Selenium, but am not sure that I must resort to that.

I have made little progress, and wonder if I can get a push in the right direction, with the following information in hand:

In order to make the POST request that will expand the collapsed panel (item 1 above) I have a traced that the on-page JS javascript:ShowCollapsiblePanel(116114,1695,44,191); will result in a POST request to TARGETURLOFWEBSITE/EventSessionAjaxService/GetSessionDetailsHtml with payload:

{"eventSessionID":116114,"eventSessionWebSiteSetupViewID":191}

The parameters for eventSessionID and eventSessionWebSiteSetupViewID are clearly in the javascript:ShowCollapsiblePanel text.

How do I use scrapy to iterate over all of the links of form javascript:ShowCollapsiblePanel? I tried to use SgmlLinkExtractor, but that didn't return any of the javascript:ShowCollapsiblePanel() links - I suspect that they don't meet the criteria for "links".

UPDATE

Making progress, I've found that SgmlLinkExtractor is not the right way to go, and the much simpler:

sel.xpath('//a[contains(@href, "javascript:ShowCollapsiblePanel")]').re('((\d+)\,(\d+)\,(\d+)\,(\d+)')

in scrapy console returns me all of the numeric parameters for each javascript:ShowCollapsiblePanel() (of course, right now they are all in one long string, but I'm just messing around in the console).

The next step will be to take the first javascript:ShowCollapsiblePanel() and generate the POST request and analyze the response to see if the response contains what I see when I click the link in the browser.

Community
  • 1
  • 1
scrampy
  • 43
  • 1
  • 6
  • You need to investigate page with developer tools, see where ajax request goes and make a request for this content. From brief investigation of page it apears that it is making POST request to get all abstract details, and then another POST to get specific abstract. You need to imitate this behavior in your spider. First extract post params from pseudo-links, then make post request. Try to implement this and if you are stuck update your question with code you used (writing code for this can take a while, so be patient). – Pawel Miech May 24 '14 at 11:44
  • Added more info - I'm still stuck and would appreciate a push in the right direction. – scrampy May 30 '14 at 21:02

1 Answers1

0

I fought with a similar problem and after much pulling out hair I pulled the data set I needed with import.io which has a visual type scraper but it's able to run with javascript enabled which did just what I needed and it's free. There's also a fork on git hub I saw last night of scrapy that looked just like the import io scraper it called ..... give me a min Portia but I don't know if it'll do what you want https://codeload.github.com/scrapinghub/portia/zip/master Good