how should I implement a small search engine in java,with nosql database and xml,and install it on a site to search that site

Question

I want to implement a small search engine in java,with nosql database and xml,and install it on a site to search that site,I have some question about it:

1.is this really a good idea?

2.the most important question:where is nosql database used,I mean in this project search engine takes a word from user and searches where is this word used and retuns those phrases to user,so what is the role of database here?

3.what is the role of xml?

4.what is the best search method for it?

5.I have read in this 2 links first link and second one that to use lucene or solr,in this project can these two be used,how and where?

6.what is the best nosql database to use for it?

7.Is this a hard project?

I will be really really thankfull for your help.

Welcome to StackOverflow. Your post is very broad and many of your questions do not have answers that are not just opinions (1, 4, 6 & 7 for sure). I would recommend doing some more research and then asking more specific questions on SO. — Ade Miller, Sep 15 '13 at 13:57
2, 3 and 5 are also potentially very broad. You're asking for a design discussion, whereas Stack Overflow is meant to provide specific answers to specific technical issues. If you run into an error you can't resolve once you've implemented a noSQL approach, for example, then you would post a request for help, along with a self-contained code example and any error output, and you would hopefully receive an answer. — MarsAtomic, Sep 15 '13 at 14:41

LMG · Accepted Answer · 2013-09-15T20:48:59.167

I will try to give you my opinions and i will be glad to have constructive feedback in comments.

First of all you are takling a very soft argument and you may not like my point of view, the following point are marked to answer your questions

1) Yes and No. Yes because you can do a smart search over the keywords stored in your html code but you don't know how many pages you have to explore. Furthermore your content may change dynamically and the keyword could be potentially useless. This last part introduces the No part. No because you need a way to know the content of the pages, like the question here in stackoverflow are marked with tags. I guess they are stored somewhere.

2) You take a world from user, and you should run a "web-spider" on your own web site to know where this world occurs. It will take time to open all the pages you have, to search it, filter it, and eventually if you write a enough good code you can parse a page in a few second, something good means like a map-reduce algorithm. EDIT: well the point is pretty clear. You don't know what kind of string or input (call it X from now on) the user will prompt. This said you store it somewhere and you start your search:

You write a script that checks all your pages, in your web site. This is pretty a bad idea. Please keep considering the stackoverflow example: how can you know exactly how many pages do you have? have you got a fix number of pages (static)? or your content changes dinamically (like the text and the number of pages in stackoverflow)? In order to do so you have to run an "algorithm" to open all your pages and look for the content. You can look for a particular type of content since you can use the keyword tag of html pages to constraint your research. If x is in the keywords you are done for a single page, and you have to loop the search until you controlled all your webpages. Waste of time and space in memory. Suppose constant time to open a socket to your web page and say you have n pages, which contains m keywords, say that x contains l words: this takes roughly O(n*m*l). (without considering the fact that maybe you want to analyze the whole page)

If you have a lot of resources you can write this "algorithm" using a map-reduce model (see here is explained quite well map-reduce).

Instead if you use something like a tag system, mapping simply the tag to the pages, and saving them into a simple table (in the simple case 3 columns: ID TAG PAGE), you can allow a fast search on your database, looking in the tag colum for x, seems much faster.

3)This question doesn't ring any bell, instead: what will you do with xml? do you want to put is some where? your pages are in xml? you want to save xml search results?

4)I think that google already provides something like that. Anyhow a good way to do it is to open each page, read xml/html depending on the page, and run a regexp to match your word.

5)The two links are selfexplicative, in the answer you really find what you need.

6)No clue.

7)No. But you should define hard. It will take you a long time to think, and find an appropriate design for it, then you will decide if lucene is good for your, if you want to use sql, or whatever.

can you please explain about number 2 more? – Nikoo Nam Sep 15 '13 at 19:44 — Nikoo Nam, Sep 15 '13 at 19:44
i hope it is enough for a roughly idea – LMG Sep 15 '13 at 20:50 — LMG, Sep 15 '13 at 20:50

how should I implement a small search engine in java,with nosql database and xml,and install it on a site to search that site

1 Answers1