-1

I'm about to create a very very big project.

How do I create a search engine with the following features:-

  1. I give it a URL and it will get all the available links in that page
  2. It should read the robots.txt file to make sure what to index and what to not index
  3. I want it to get any pages add to any site in the database without recrawling it
  4. It reads the xml sitemaps
  5. How to work with keywords

and if possible, please : how do i structure my database?

pavium
  • 14,808
  • 4
  • 33
  • 50
Mesaber
  • 57
  • 2
  • 6

2 Answers2

5

The first two items you mentioned are the outline. You can start coding those right now.

The rest are some of the things that took Yahoo, and then Google, many man-years to discover and implement. Start off with what you know, learn from your experiences and mistakes, and start again with revision 2. And so on.

jcomeau_ictx
  • 37,688
  • 6
  • 92
  • 107
  • 4
    Many years ago, when I worked for a firm of Patent Attorneys I was approached by a woman who wanted to discuss her *invention*. "What is the invention?" I asked. "It's a method of detecting drugs *electronically*", she said. "How does it work?", I enquired. "Well, I don't know", she said, "I was hoping you could put me in touch with someone who could work out the details"... "That's not really 'invention' I said". – pavium Jun 11 '11 at 06:07
3

Number 1 - 4 are your first phase. This is the crawling phase where you gather all your information. You need to write a crawler which goes from page to page while adding links to its database. You'll also need to figure out which pages need to be crawled more/less often.

Once you have that sorted, you'll have to look at algorithms for figuring out what a page is actually talking about. You'd need to break a page down to its components and store the meaning of it. You'd also need loads of hard drive memory to store the text in the pages.

Related

How do I make a simple crawler in PHP?

Where do search engines start crawling?

Google-like Search Engine in PHP/mySQL (most basic text matching)

how does spider in a search engine works?

Community
  • 1
  • 1
JohnP
  • 49,507
  • 13
  • 108
  • 140