I have a huge amount of data needs to be indexed and it took more than 10 hours to get the job done. Is there a way I can do this on hadoop? Anyone has done this before? Thanks a lot!
3 Answers
You haven't explained where does 10hr take? Does it take to extract the data? or does it take just to index the data.
If you are taking long time on the extraction, then you may use hadoop. Solr has a feature called bulk insert. So in your map function you could accumulate 1000s of record and commit for index in one shot to solr for large number of recods. That will optimize your performance alot.
Also what size is your data?
You could collect large number of records in reduce function of map/reduce job. You have to generate proper keys in your map so that large number of records go to single reduce function. In your custom reduce class, initialize solr object in setup/configure method, depending on your hadoop version and then close it in cleanup method.You will have to create a document collection object(in solrNet or solrj) and commit all of them in one single shot.
If you are using hadoop there is other option called katta. You can look over it as well.

- 2,704
- 1
- 21
- 25
-
Thanks a lot, Animesh! The time was mainly indexing the data, since i have processed data before running a java program to call solr over http. And this program was running on the same machine as where the solr server is. Maybe i should check with bulk insert?... – trillions Jul 25 '12 at 00:46
-
yeah I have done that before and bulk insert will really reduce alot of time. – Animesh Raj Jha Jul 25 '12 at 02:03
-
Thanks a lot, Animesh! And the data i have is more than 20Million. Just to confirm, for bulk insert, you meant to "keep adding doc" and once hit like 1000 records, then do a commit, right? – trillions Jul 25 '12 at 09:11
-
Another question, if i use hadoop, should i just use it for the http call to Solr, then do the bulk commit, then i am all done? Sorry for asking for the details, as i am new to all these things. And thanks a lot for your help!! :) – trillions Jul 25 '12 at 09:17
-
Thanks a lot for the detailed answer. My coworker mentioned katta as well, i will look into it :) Thanks!! – trillions Jul 25 '12 at 23:30
You can write a map reduce job over your hadoop cluster which simply takes each record and sends it to solr over http for indexing. Afaik solr currently doesn't have indexing over cluster of machines, so it would be of worth to look into elastic search if you want to distribute your index also over multiple nodes.

- 5,114
- 7
- 39
- 61
There is a SOLR hadoop output format which creates a new index in each reducer- so you disteibute your keys according to the indices which you want and then copy the hdfs files into your SOLR instance after the fact.
http://www.datasalt.com/2011/10/front-end-view-generation-with-hadoop/

- 17,388
- 22
- 92
- 167