Recursively Saving a Website

Question

I'm new to both JavaScript and FireFox cfx SDK. I'm trying to write an extension for FireFox to completely save the contents of a URL by recursively crawling inside it. The program can be divided into the following stages:

1- Saving contents of a given URL(including images, text, URLs and etc).
2- Crawling pages(a.Extracting URLs inside the page, b. Recursively traversing them).

I'd be thankful if someone gives me some hints(e.g keywords to study or links to read, which parts can be done with cfx SDK and which parts with JavaScript and etc) There's one more thing to say that requesting pages should be done with current session(like user has opened the URL in a tab)[the user may be logged in his account]

Anything might be helpful, thanks in advance :-)

Are you asking the community to design the extension for you? Or are you asking for resources? That isn't very clear. — unbindall, Feb 05 '15 at 23:45
@DominatorX this is a valid question, please see why I think these down votes are not valid. Re: Vast API — Noitidart, Feb 06 '15 at 00:53
Actually questions like these are extremely valid. I'm a 6k pointer on SO right now and I asked how to iconify windows and another user outlined it for me. I used that as a starting point for an addon. See the topic here and see how the solution provides a brainstorm algorithm I could use in my addon: http://stackoverflow.com/a/24030011/1828637 — Noitidart, Feb 06 '15 at 01:07

score 1 · Answer 1 · edited May 23 '17 at 10:26

This is a valid question. As beginners need help being pointed in the right direction as the XPCOM/HTML5/Other APIs is so huge.

This is how I would do it:

You could XMLHttpRequest (Sending Data to a Server using JavaScript(Firefox Addon)) and fetch the html of a page. And then pass the page to a parser like this: (How to parse a XML string in a Firefox addon using Add-on SDK) and then you can get all URL's on the page by going

var parser = new DOMParser();
var doc = parser.parseFromString(reponseFromAjax, "text/html");

var URLs = doc.getElementsByTagName('a');
var IMGs = doc.getElementsByTagName('img');

If the user had not asked this question I guarantee you we would probably have another case of a dev suffering through running string manipulation on a AJAX return text. And worse maybe Regex on the return text.

To use these XPCOM things from cfx AddonSDK see the comments in that xmlhttprequest topic i linked. It states how to import chrome (Cu/Ci/etc)

Recursively Saving a Website

1 Answers1