0

I need to save the data from a bunch of popup windows that are triggered with JS to retrieve the data fro the server and display that data in an iframe popup, clicking each link and copying the data will take forever...

I need a way to scrape the data that I can then sort

The links

<a href="javascript:getReport('111')">LINK</a>
<a href="javascript:getReport('112')">LINK2</a>

The JS

function getReport(ID) {
var id = ID;
var modalType = 'user';
parent.parent.$.fn.colorbox({
    href: ReportUrl + '?sharedID=' + id + '&modalType=' + modalType,
    open: true,
    iframe: true,
    width: 700,
    height: 400,
    title: 'report',
    close: "<button id=\"Close\" onClick=\"javascript:parent.parent.$.fn.colorbox.close()\">Close</button>",
    onClosed: false
});

}

My thoughts 1. is there a way to trigger them all open, copy all the data then sort it. 2. Is the ther a way to save each one as an html file, I can again sort though.

Once I have the data accessible locally I can sort it with out much of an issue, its just a matter of how I can get the data, I have looked around but don't see any way of scraping the data, Since the page I want to scrape isn't on a set url, you need to navigate JS links that then bring up the html page, This is also all behind a login.

If anyone has any suggestions I would be really greatful.

ZombieDood
  • 11
  • 2
  • I wouldnt use Javascript or iframes for this. I'd use `PHP` and `cURL` to get the pages. Check out this link http://www.jacobward.co.uk/web-scraping-with-php-curl-part-1/ to get started. – Wesley Smith Aug 10 '16 at 13:59
  • @DelightedD0D The issue is that the site is behind a login and the page where the content is doesn't have a set url, its generated via JS, SO I can get a page with a few thousand links up with my browser, but I'm not sure it would work with php/curl, I was thinking of manipulating the js to trigger all the elements or something along those lines. – ZombieDood Aug 10 '16 at 14:19
  • You dont need any of that, PHP and cURL are more than capable of handling any logins you need and you can pass in any variables you want. You can even build the URL with get params the same way you are doing it now. Given the article I linked to, you could do `$scraped_website = curl("http://www.example.com?sharedID=1234&modalType=some value");` – Wesley Smith Aug 10 '16 at 14:28
  • If you insist, you can still use all the iframes and stuff for your interface, just write the PHP script separately then when you want to trigger the scrape, send the generated url to the php script via AJAX and get the results in the success function. – Wesley Smith Aug 10 '16 at 14:36
  • I see what you are saying, that might work, I would still need to get all the IDs though to make that work wouldn't I ? Then feed them in to the script – ZombieDood Aug 10 '16 at 14:58
  • Correct. A simple approach might be to collect all of the ids and send them as an array. In the php script, loop over the array processing the data for each id creating a new array like `[['someId'=>'some processed value], ...]`. JSON encode that array and return it back to your ajax call and use the processed data on the page. – Wesley Smith Aug 10 '16 at 18:20
  • As far as actually processing the page data, deffinately dont try to use regular expressions, have a look at [simplehtmldom](http://simplehtmldom.sourceforge.net/) – Wesley Smith Aug 10 '16 at 18:21

1 Answers1

-1

If the URLs you're trying to scrape don't exist in the same domain as the page containing the "scraper" code, it won't work due to cross-domain security.

Otherwise, you can use jQuery/AJAX instead of a popup:

jQuery.ajax({
  url: ReportUrl + '?sharedID=' + id + '&modalType=' + modalType,
  method: 'GET',
  success: function(res) {
    console.log(res.responseText); // res.responseText is the content from the response, typically HTML source code
  },
  error: function() {
    console.warn('Something happened');
  }
});

Again, this will only work on the same domain.

Jeremy Klukan
  • 228
  • 1
  • 4
  • While this will work in some cases, it is a very inflexible and fragile approach especially when you take into account all the iframes the OP intends to use. – Wesley Smith Aug 10 '16 at 14:32
  • The OP is currently using iframes to load and display the URLs and wants to scrape their content. My suggestion allows the content to be scraped without the use of iframes, which seem to be used here only to load the content. On the same domain, you can also access the DOM of IFRAME-loaded content, but that would be even more inflexible and fragile, plus it would take more code. – Jeremy Klukan Aug 10 '16 at 15:00
  • I understand, and this *technically* answers the question. However, IMHO, it is a poor approach to solve the problem for many reasons, not the least of which is that this approach will undoubtedly lead to the OP attempting to [parse the resulting HTML with regular expressions](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) or simple string manipulation which is absolutely NOT the way to do this. Server side scripting has tools made specifically for this which should be used instead – Wesley Smith Aug 10 '16 at 18:07