1

I'm preparing to make distribute search module with lucence and hadoop but fell confused with something:

  1. as we know , hdfs is a distribute file system ,when i put a file to hdfs , the file will be divided into severial blocks and stored in diffrent slave machine in the claster , but if i use lucene to write index on hdfs , i want to see the index on each machine , how to acheived it ?

  2. i have read some of the hadoop/contrib/index and some katta ,but don't understand the idea of the "shards ,looks like part of the index" , it was stored on local disk of one computer or only one directionary distribut in the cluster ?

Thanks for advance

Martín Schonaker
  • 7,273
  • 4
  • 32
  • 55
mandy
  • 11
  • 1

1 Answers1

1

-As for your Question 1:

You can implement the Lucene "Directory" interface to make it work with with hadoop and let hadoop handle the files you submit to it. You could also provide your own implementation of "IndexWriter" and "IndexReader" and use your hadoop client to write and read the Index. This way since you could have more control about the format the index you will write. You can "see" or access the index on each machine via the your lucene/hadoop implementation.

-For your question 2:

A shard is a subset of the index. When you run your query all shards are processed in the same time and the results of the index search on all shards are combined. On each machine of your cluster you will have a part of your index: a shard. So a part of the index will be stored on a local machine but will appear to you as as a single file distributed across the cluster.

I can also suggest you to checkout the distributed search SolrCloud, or here It is runs on Lucene as indexing/search engine and already enables you to have a clustered index. It also provides an API for submitting the files to index and query the index. Maybe it is sufficient for your use case.

Esquive
  • 181
  • 7
  • Thank you for this inspired answer. Could you share me more about how to design a custom lucene writer and using mapreduce to generate index on HDFS? – hakunami Apr 20 '14 at 10:52
  • We could. I stopped working on this project just after that. So I'm not really into this yet. But I'm happy I got back on it just few days ago. I can already tell you that I decided not to use hadoop and hdfs for that purpose. The reason for that is that mapreduce is rather designed to provide results for some raw data. Roughly: you don't index your documents before storing them. You handle all in your map/reduce methods when searching your clusters. I decided to replace hadoop by cassandra a p2p database where the data is already stored aggregated the way your queries have good performance. – Esquive Apr 22 '14 at 16:11
  • Cassandra is a row key/value store. Well suited for a search index. If you want to know more about this you can have look at the "lucandra" project. The author decided to use the IndexWriter/Reader approach you can see how he implements the custom writer and reader. We can also stay in touch knowing I'm only monitoring this project a couple of hours per week. – Esquive Apr 22 '14 at 16:16
  • I never heard these projects before, will check them. Thank you for the information, you are really kind! – hakunami Apr 22 '14 at 16:18