3

I have an MVC application which I need to be able to search. The application is modular so it needs to be easy for modules to register data to index with the search module.

At present, there's just a quick interim solution in place which is fine for flexibility, but speed was always going to be a problem. Modules register models (and relationships and columns) which they'd like to be searchable. Upon search, the search functionality queries data using those relationships and applies Levenshtein, removes stop words, does character replacements etc. Clearly this will slow down as the volume of data increases so it's not viable to keep as it is effectively select * from x,y,z and then mine through the data.

The benefit of the above is such that there is a direct relation to the model which found the data. For example, if Model_Product finds something, I know that in my code i can use Model_Product::url() to associate the result off to the relevant location or Model_Product::find(other data) to show say the image or description if the keyword had been found in the title for example.

Another benefit of the above is it's already database specific, and therefore can just be thrown up onto a virtualhost and it works.

I have read about the various options, and they all seem very similar so it's unlikely that people are going to be able to suggest the 'right' one without inciting discussion or debate, but for the record; from the following options, Solr seems to be the one I'm leaning toward. I'm not set in stone so if anyone has any advice they'd like to share or other options I could look at, that'd be great.

Looking through various tutorials and guides they all seem relatively easy to set up and configure. In the case above I can have modules register the path of config files/search index models and have the searcher run them all through search program x. This will build my indexes, and provide the means by which to query data. Fine.

What I don't understand is how any of these indexes related to my other code. If I index data, search and in turn find a result with say Solr, how do I know how to get all of the other information related to the bit it found?

Also is someone able to confirm whether or not I will need to have an instance of any of the above per virtualhost? This is something which I can't seem to find much information on. I would assume that I can just connect to a single instance and tell it what data is relevant? Much like connecting to a single DBMS server, with credentials x to database y.

Granted I haven't done as extensive reading on this as I would have typically because I'm a bit stuck in terms of direction at the moment and I'd rather not read everything about everything in favour of seeking some advice from those who know before I take a particular route.

Edit: This question seems to have swayed me more towards Solr. There's also a similar thread here with a fair amount of insight into Sphinx.

Community
  • 1
  • 1
Ben Swinburne
  • 25,669
  • 10
  • 69
  • 108

1 Answers1

0

DISCLAIMER: I can only speak about Lucene/Solr and, I believe, ElasticSearch as I know it is based on Lucene. Others might or might not work in the same way.

If I index data, search and in turn find a result with say Solr, how do I know how to get all of the other information related to the bit it found?

You can store any extra data you want, e.g. a database key pointing to a particular row in the database. Lucene/Solr can also help you to find relative information, e.g. if you run a DVD rent shop and user has misspelled a movie name, Lucene will figure this out for you and (unlike with DB) still list the closest alternatives. You can also provide hints by boosting certain fields during indexing or querying. There are special extensions for geospatial search, etc. And obviously you can provide your own if you need to.

Also is someone able to confirm whether or not I will need to have an instance of any of the above per virtualhost?

Lucene is a low level library and will have to be present in every JVM you run. Solr (built on top of Lucene) is an HTTP server. You can call it from as many clients as you want. More scaling options explained here.

mindas
  • 26,463
  • 15
  • 97
  • 154
  • to add: ElasticSearch, like Solr, is a http-server as well. – Geert-Jan May 20 '13 at 14:38
  • How could I specify say a particular model name, or information not found in the dataset but about the dataset with the indexed information. For example if I index my `products` table, I would like the index to be able to refer back to Model_Product so that I know the row ID and can therefore use Model_Product to get information which is pertinent to my results but not necessarily found in that particular row, i.e. a related table. I've not been able to find information on how to do this yet. – Ben Swinburne May 22 '13 at 21:13
  • You need to denormalize your data structure into a flat document. In other words, you need to store either foreign key to another entity so it can be retrievable, or index relevant entity fields, or both. – mindas May 22 '13 at 21:29
  • I can store the foreign key etc easily, that's fine because it's in the data set. What I actually need is to be able to store extra information which can't be found using a query, but automatically relates to that data. For example I could do `select * from products` to build a products collection, but when retrieving that data I need to know which table for example the data came from. Is there a way i can say `select * from products + some static data not in the database`? – Ben Swinburne May 23 '13 at 14:09
  • This is definitely possible and depends on how you implement your integration with Solr. A request to index a document may contain as much additional information as necessary, so you can add your `product_source` field. Solr doesn't really care where this data came from - let it be database, or your code, or anything else. – mindas May 23 '13 at 14:48
  • I'm struggling to find where this is documented, I can only find mapping fields to data from a query, not data from an external source. Where is this documented so I know how to do it? I'm sure conceptually it is possible but I can't find how. Cheers – Ben Swinburne May 23 '13 at 15:48
  • Solr tutorial covers this: http://lucene.apache.org/solr/4_3_0/tutorial.html (see chapter on Indexing Data). You can also have a look at http://wiki.apache.org/solr/DataImportHandler – mindas May 23 '13 at 16:19