I just have a SQL table of all the links I have found, and if they have been parsed or not.
I then use Simple HTML DOM to parse oldest added page, although as it tends to run out of memory with large pages (500kb+ of html) I use regex for some of it*. For every link I find I add it to the SQL database as needing parsing, and the time I found it.
The SQL database prevents the data being lost on an error, and as I have 100,000+ links to parse, I do it over a long period of time.
I am unsure, but have you checked the useragent of file_get_contents()? If it isn't your pages and you make 1000s of requests, you may want to change the user agent, either by writing your own HTTP down loader or using one from a library(I use the one in the Zend Framework) but cURL etc work fine. If you use a custom user agent, it allows the admin looking over logs to see the information about your bot. (I tend to put the reason why I am crawling and a contact in mine).
*The regex I use is:
'/<a[^>]+href="([^"]+)"[^"]*>/is'
A better solution (From Gumbo) could be:
'/<a\s+(?:[^"'>]+|"[^"]*"|'[^']*')*href=("[^"]+"|'[^']+'|[^<>\s]+)/i'