Building a web search engine

Question

I've always been interested in developing a web search engine. What's a good place to start? I've heard of Lucene, but I'm not a big Java guy. Any other good resources or open source projects?

I understand it's a huge under-taking, but that's part of the appeal. I'm not looking to create the next Google, just something I can use to search a sub-set of sites that I might be interested in.

It depends on your favorite programming languages. Java is beyond question. Do you code in asp.net, perl, python, php, ... that would be important to know before any adequate answer might be offered :) — Anheledir, Sep 21 '08 at 22:13
hey! check out [mine](http://code.google.com/p/goomez/)... very simple file searcher based on [lucene.net](http://incubator.apache.org/lucene.net/) — sebagomez, Oct 23 '08 at 02:41

SquareCog · Accepted Answer · 2009-02-06T01:32:12.013

There are several parts to a search engine. Broadly speaking, in a hopelessly general manner (folks, feel free to edit if you feel you can add better descriptions, links, etc):

The crawler. This is the part that goes through the web, grabs the pages, and stores information about them into some central data store. In addition to the text itself, you will want things like the time you accessed it, etc. The crawler needs to be smart enough to know how often to hit certain domains, to obey the robots.txt convention, etc.
The parser. This reads the data fetched by the crawler, parses it, saves whatever metadata it needs to, throws away junk, and possibly makes suggestions to the crawler on what to fetch next time around.
The indexer. Reads the stuff the parser parsed, and creates inverted indexes into the terms found on the webpages. It can be as smart as you want it to be -- apply NLP techniques to make indexes of concepts, cross-link things, throw in synonyms, etc.
The ranking engine. Given a few thousand URLs matching "apple", how do you decide which result is the best? Jut the index doesn't give you that information. You need to analyze the text, the linking structure, and whatever other pieces you want to look at, and create some scores. This may be done completely on the fly (that's really hard), or based on some pre-computed notions of "experts" (see PageRank, etc).
The front end. Something needs to receive user queries, hit the central engine, and respond; this something needs to be smart about caching results, possibly mixing in results from other sources, etc. It has its own set of problems.

My advice -- choose which of these interests you the most, download Lucene or Xapian or any other open source project out there, pull out the bit that does one of the above tasks, and try to replace it. Hopefully, with something better :-).

Some links that may prove useful: "Agile web-crawler", a paper from Estonia (in English) Sphinx Search engine, an indexing and search api. Designed for large DBs, but modular and open-ended. "Information Retrieval, a textbook about IR from Manning et al. Good overview of how the indexes are built, various issues that come up, as well as some discussion of crawling, etc. Free online version (for now)!

Here is my implementation of the ranking engine (elasticsearch) and the front end (angularjs) https://machinelearningblogs.com/2016/12/12/how-to-build-a-search-engine-part-1/ — Vivek Kalyanarangan, Jul 09 '17 at 06:47

score 6 · Answer 2 · answered Sep 21 '08 at 21:40

6

Xapian is another option for you. I've heard it scales better than some implementations of Lucene.

answered Sep 21 '08 at 21:40

Oli

235,628
64
220
299

score 6 · Answer 3 · answered Sep 28 '08 at 23:44

6

Check out nutch, it's written by the same guy that created Lucene (Doug Cutting).

answered Sep 28 '08 at 23:44

Mauricio Scheffer

98,863
23
192
275

score 5 · Answer 4 · answered Sep 21 '08 at 21:40

5

It seems to me that the biggest part is the indexing of sites. Making bots to scour the internet and parse their contents.

A friend and I were talking about how amazing Google and other search engines have to be under the hood. Millions of results in under half a second? Crazy. I think that they might have preset search results for commonly searched items.

edit: This site looks rather interesting.

answered Sep 21 '08 at 21:40

Joel

16,474
17
72
93

They do -- they put out academic papers on best ways to cache results on a regular basis. Do you just cache the most recent answers? Do you look at query logs and try to predict what you need to cache and precompute it? Fascinating stuff. – SquareCog Sep 21 '08 at 22:35

bmb · Answer 5 · 2011-08-17T18:50:24.930

4

I would start with an existing project, such as the open source search engine from Wikia.

[My understanding is that the Wikia Search project has ended. However I think getting involved with an existing open-source project is a good way to ease into an undertaking of this size.]

http://re.search.wikia.com/about/get_involved.html

edited Aug 17 '11 at 18:50

answered Sep 21 '08 at 22:01

bmb

6,058
2
37
58

score 1 · Answer 6 · answered Sep 29 '08 at 00:06

If you're interested in learning about the theory behind information retrieval and some of the technical details behind implementing search engines, I can recommend the book Managing Gigabytes by Ian Witten, Alistair Moffat and Tim C. Bell. (Disclosure: Alistair Moffat was my university supervisor.) Although it's a bit dated now (the first edition came out in 1994 and the second in 1999 -- what's so hard about managing gigabytes now?), the underlying theory is still sound and it's a great introduction to both indexing and the use of compression in indexing and retrieval systems.

score 1 · Answer 7 · answered Feb 07 '10 at 09:55

1

I'm interested in Search Engine too. I recommended both Apache Hadoop MapReduce and Apache Lucene. Getting faster by Hadoop Cluster is the best way.

answered Feb 07 '10 at 09:55

klainfo

11
2

score 0 · Answer 8 · answered Sep 21 '08 at 21:37

0

There are ports of Lucene. Zend have one freely available. Have a look at this quick tutorial: http://devzone.zend.com/node/view/id/91

answered Sep 21 '08 at 21:37

Oli

235,628
64
220
299

score 0 · Answer 9 · answered Sep 24 '08 at 00:47

0

Here's a slightly different approach, if you are not so much interested in the programming of it but more interested in the results: consider building it using Google Custom Search Engine API.

Advantages:

Google does all the heavy lifting for you
Familiar UI and behavior for your users
Can have something up and running in minutes
Lots of customization capabilities

Disadvantages:

You're not writing code, so no learning opportunity there
Everything you want to search must be public & in the Google index already
Your result is tied to Google

answered Sep 24 '08 at 00:47

Tim Farley

11,720
4
29
30

Wouldn't exactly call it an API... – Sean Dec 01 '08 at 14:13
Why not? Not every API is a set of callable functions. You can host the XML description of your search engine on your own website, and then you aren't even using Google's web interface for this. – Tim Farley Dec 02 '08 at 19:30

Building a web search engine

9 Answers9

Linked