You have to decide at some point just how large you want your crawled list to become. Up to a few tens of millions of items, you can probably just store the URLs in a hash map or dictionary, which gives you O(1) lookup.
In any case, with an average URL length of about 80 characters (that was my experience five years ago when I was running a distributed crawler), you're only going to get about 10 million URLs per gigabyte. So you have to start thinking about either compressing the data or allowing re-crawl after some amount of time. If you're only adding 100,000 URLs per day, then it would take you 100 days to crawl 10 million URLs. That's probably enough time to allow re-crawl.
If those are your limitations, then I would suggest a simple dictionary or hash map that's keyed by URL. The value should contain the last crawl date and any other information that you think is pertinent to keep. Limit that data structure to 10 million URLs. It'll probably eat up close to 2 GB of space, what with dictionary overhead and such.
You will have to prune it periodically. My suggestion would be to have a timer that runs once per day and cleans out any URLs that were crawled more than X days ago. In this case, you'd probably set X to 100. That gives you 100 days of 100,000 URLs per day.
If you start talking about high capacity crawlers that do millions of URLs per day, then you get into much more involved data structures and creative ways to manage the complexity. But from the tone of your question, that's not what you're interested in.