0

Short Version

I would like to efficiently perform a full text search within an arbitrary set of objects in my database. All objects will be indexed in a search engine.

My idea

I'm planning on making this a 2 part operation. First the search engine would be queried for a weighted/sorted set of ids matching the full text search. This set of ids would them be filtered, removing any ids not in the user's original set.

Is there a better way to do this? If not, can you provide any advice on doing this efficiently?

Long Version

I am in the planning phase of building a web application that will allow users to visualize sets of highly linked data and manipulate these visualizations to derive sets of interesting vertices for further analysis. The filtering actions performed by the user through the gui will be complex and very difficult to express as index-able quantities.

I would like to allow the users to perform full text search for results within these data sets. Looking at what Google does for searching within a result set, their approach of simply appending an earlier search query to a new query to enable "search within" may not be feasible for my data.

The accepted answer to this question promotes the idea of using database operations to filter results coming from a search engine.

As part of the solution, I am also considering having the front end switch over to using lunr when the set of vertices the user wants to search within gets small enough for the front end to handle. Figuring out what this limit is will take some testing, but I doubt it will be several thousand, so the need for a server side solution remains.

Environment Details

I'm running python 2.7 on appengine.

In this application, I expect the initial result sets (that will be searched within) to contain between 10 and 2000 vertices. The total number of vertices in the entire database could be a couple orders of magnitude larger.

Community
  • 1
  • 1
turtlemonvh
  • 9,149
  • 6
  • 47
  • 53

2 Answers2

1

If you try to use GAE Datastore for analytics, you are going to have a bad time.

Datastore queries are very limited, not having inequality filtering on multiple properties or having full text search.

You might want to look at Google BigQuery, which has rich queries and supports regex filtering. It also supports "intermediary tables", where you can use result of one query as input data for another - I don't full grasp your problem, but it seems this is what you need.

Peter Knego
  • 79,991
  • 11
  • 123
  • 154
  • Ha ha at the Southpark reference. But this isn't for analytics, I promise. Also it looks like BigQuery is read-only, so that's not going to work for me. The data will not change very often, but it will need to change. The idea of caching intermediate values for faster resolution of key queries is probably something I can use, though. – turtlemonvh Jul 15 '13 at 20:53
  • Now that I think of it, although this is not directly analytics, it does have that sort of a flavor to it. @stevep's recommended answer reminds be of [how gauges uses mongodb for analytics](http://www.10gen.com/presentations/mongodb-analytics). – turtlemonvh Jul 15 '13 at 21:48
1

TLDNR: Can you do some pre-processing to set up a large hash dictionary that the client could use to support the different queries?

How dynamic and big is your data? I am working on something that may be similar if your data is relatively static. We have web page that lets users create AND and OR selections by selecting any combination of about 300 variables. Each variable can have many hundreds of items associated with it. Because the datasets for the variables are relatively static and not ginormous, we have created them as json.dumped text in a TextProperty field. When parsed by the browser, the json simply becomes a big dictionary keyed by the variable ids. Each key's value is an array of items (image ids in our case) associated with the selected key. All the intersections and combinations are done with a few small Javascript functions that are fed with these arrays. This has worked very well - users complement the speed, and this approach greatly, greatly simplifies the GAE side. All the json variables are loaded/updated in a somewhat lazy, near-realtime fashion via crons and taskqueues. For final display, the results are formatted and inserted into a div's innerHTML. Once all the images are cached, the browser response for formatting and displaying hundreds of 420x280 pixel images is nearly instantaneous. Very cool to see, and a tribute to the folks working on the browsers -- both layout and JS optimizations. (I should note that we use pure JS to ensure minimal overhead vs. something like JQuery.) HTH -stevep

stevep
  • 959
  • 5
  • 8
  • Steve - the data is similar to the data that a bookmark program would get. Lots of linked items that are stored by individual users in hierarchies and also tagged. So there will be a lot of tags, a lot of folders, and a lot of items. – turtlemonvh Jul 15 '13 at 20:58
  • Your approach does sound interesting, and I'd like to try to understand it a bit more. What I'm hearing: You have ~300 possible attributes for several hundred photos, and the list of every matching photo is stored as a string-ified json array object linked to the attribute object. – turtlemonvh Jul 15 '13 at 21:03
  • That may actually work for me. If I restrict my pre-computed tag-item associations to a per-user scope the data may be small enough for this to work. I'm going to test that out... – turtlemonvh Jul 15 '13 at 21:05
  • I hope you are successful tutlemonvh. In case you have interest, this SO link is very good re: performant array intersection using JS. I am using intersect_safe. Link:http://stackoverflow.com/questions/1885557/simplest-code-for-array-intersection-in-javascript – stevep Jul 16 '13 at 22:02
  • Thanks, that is a great resource - I bookmarked that exact page a few days ago for this very reason. – turtlemonvh Jul 18 '13 at 12:38