I'm creating a site spider that grabs all the links from a web page as well as that page's html source code. It then checks all the links it has found and keeps only the internal ones. Next it goes to each of those internal pages and repeats the above process.
Basically it's job is to crawl all the pages under a specified domain and grab each page's source. Now the reason for this is, I want to run some checks to see if this or that keyword is found on any of the pages as well as to list each page's meta information.
I would like to know if I should run these checks on the html during the crawling phase of each page or if I should save all the html in an array for example and run the checks all the way at the end. Which would be better performance wise?