How to improve performance of matching algorithm

Question

I am writing an algorithm for matching on basis of interests and location. Suppose I have this data of users

{
    "users": [{
            "location": "Delhi, India",
            "interests": ["Jogging", "Travelling", "Praying"],
            "groups": ["exercise", "travelling", "Praying"]
        },
        {
            "location": "Delhi, India",
            "interests": ["Running", "Eating", "Praying"],
            "groups": ["exercise", "Eating", "Praying"]
        }, {
            "location": "Delhi, India",
            "interests": ["Shopping"],
            "groups": ["Shopping"]
        }
    ]
}

Here they user1 and user2 has similar interest "exercise" and "Praying" and user1 and user3 has no similar interest.

To find similar interest people in a database of 10+ millions users can impact on my database performance if I use SQL query with where clause everytime on receiving request from mobile app.

SELECT * FROM users WHERE groups = "exercise" OR groups = "travelling" OR groups = "Praying";

This will check each profiles that may impact on performance of my application. I do not want to use this approach as this is not going to work long. What algorithm should I use for this to have high performance ?

You may wanna take a look at a Graph database, Neo4J for example. — Mehdi, Apr 24 '17 at 07:50
I am using `firebase` database for my android app. I am not sure if it supports graph database functionality — N Sharma, Apr 24 '17 at 08:47
It does, but you'll have to model the data diffrently. See http://stackoverflow.com/questions/40656589/firebase-query-if-child-of-child-contains-a-value. I also recommend reading [NoSQL data modeling](https://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/) and watching [Firebase for SQL developers](https://www.youtube.com/playlist?list=PLl-K7zZEsYLlP-k-RKFa7RyNPa9_wCH2s). — Frank van Puffelen, Apr 24 '17 at 13:48
*I am not sure if it supports graph database functionality* It doesn't. — , Apr 26 '17 at 09:22

Yavar · Accepted Answer · 2017-04-26T11:22:57.440

4

You can construct an inverted index where key would be one of the tokens (i.e. either exercise, travelling etc) in 'group' and the value would be a list of Users that fall under that group. For example, your inverted index would look something like this:

Key: ListOfValues
Exercise: User1 -> User2
Praying: User1 -> User2
Travelling: User1 -> User3 -> User8 -> User14
Shopping: User3

Whether you want a tree based, bitmap or a hash table based inverted index would be your choice depending on your space/time tradeoffs.

Now when you get a new user, say User99 having group (Exercise and Praying) you can quickly retrieve the values (i.e. the Users) for the 'Exercise' token then retrieve the values for 'Praying' token and then finally do an 'AND'(intersection) for the two.

Note that running it for the first time will be batch processing however as and when you start getting new users, your running time complexity would almost be constant (This would hold true if you have smart Data structure something like a compressed bitmap as your posting lists for 'User' values in the inverted index otherwise intersection wont be faster than O(n) AFAIK)

edited Apr 26 '17 at 11:22

answered Apr 26 '17 at 08:29

Yavar

11,883
5
32
63

Thanks! Lets take an example If I have millions of users in my `firebase-database` then on receiving a request from android mobile app then `firebase` also iterate each users in groups which has similar interest. Won't it impact performance ? – N Sharma Apr 26 '17 at 09:25
Why dont you write some separate piece of code for this functionality? Well it is always advisable not to hit database directly, maybe your Application Server should call your custom code (say for example a REST API) that does the job. Once you construct your inverted index by iterating through your database you do not need to go back to the database for your queries. The optimized/compressed inverted index can be stored on RAM or Disk (probably as Flat files) and it would answer your queries.This is a simplification of as to how Boolean Retrieval works in a Search Engine like Lucene/Google :) – Yavar Apr 26 '17 at 10:57
Other alternative taking into account time/space tradeoff would be a Bitmap posting list where you can actually do a bitwise 'AND' to arrive at the result. However it would be sparse and you will have to deal with higher Space Complexity by constructing a smart compressed bitmap. However retrieval would be super-fast, constant time. – Yavar Apr 26 '17 at 11:16
"The optimized/compressed inverted index can be stored on RAM or Disk (probably as Flat files) and it would answer your queries." - I wish firebase support this to implement redis to keep data in cache, Afaik it does not – N Sharma Apr 26 '17 at 11:18
1

There must be some simpler alternative solution to your problem in the framework that you are using, lets wait and watch, someone will answer it shortly :) – Yavar Apr 26 '17 at 11:58

semchev · Answer 2 · 2017-05-03T03:53:00.513

An Illustration:

If you have some way of obtaining a complete list of interests (perhaps you are letting them choose a specific entry from a set of interests), you can use simple matrix multiplication with a respective search vector.

Edit: This approach also works inverted i.e, so long as you transpose properly you can map users to groups instead of groups to users, and you may like to do this as you will likely have far more users than groups, although the example is the same in principal.

Let groups = [
  1: "exercise" 
  2: "traveling"
  3: "praying"
  4: "eating"
  5: "running"
  6: "shopping"
]

Let U = [
  1 1 1 0 0 0  // user 1
  0 0 1 1 1 0  // user 2
  0 0 0 0 0 1  // user 3
]

You are using OR as you wanted members in any group

Let V = [
  1  // exercise
  1  // traveling
  1  // praying
  0  // eating
  0  // running
  0  // shopping
]

Multiplying:

U · V = [
  3  // user 1 is in all 3 groups => match
  1  // user 2 is in one group => match
  0  // user 3 is in no groups => no match
]

This checks every user for the presence of one ore more of the requested columns (OR), and any nonzero entries in the resulting vector are a match.

Alternatively, with the same exact query, selecting only users with a specific set of 2 or more columns (AND) would consider a match as any n or greater valued entries in the resulting vector where n is the number of parameters.

Selecting only those with specifically one or more columns and not one or more other columns (XOR) would consider as matches only those resulting entries with exactly n value.

Is This Really a Good Idea? This sort of approach could be used if you think the real issue is that queries may become sufficiently complex that the query analyzer would become the bottleneck, queries would become extremely hard to manage, or you need to deal with an ever altering list of 'groups', or if you simply intend to do a lot of Linear Algebra in your application.

The solution depends foremost on your use case. For example, if query speed is of utmost importance and data transfer is less of an issue, this approach would allow for a very simple query that returns all (with LIMIT) the rows and you can then optimally sift through until you find the number of users you want for a given page, only running subsequent queries as necessary to load more pages. Since you mentioned this occurring every time you receive a request from a mobile app, perhaps you would be better off caching a manageable amount of users and polling this instead of the database every time, unless insufficient matches are found, implementing a suitable time-tested cache-replacement algorithm (which can also potentially be offloaded to the client to some extent).

Conclusion/tl;dr The important take-away here is the structure you want entirely depends on the business requirements of your application. You can make the data structures as esoteric as you like with the intent of improving the performance, but this is often less fruitful than simply using a time-tested solution such as basic caching.

If you think a refactor to an inverted key approach like that suggested by Yavar will best suit your needs, that may be your solution.

If you think a graph database is necessary, will fulfill your business requirement, and be faster, easy to manage, this may be your solution.

If your needs are so specific that you need an entirely specific custom-built implementation that is entirely optimized to your application and not necessarily useful elsewhere, that may be your solution.

Design standards exist for good reasons, but optimization can incidentally be very domain specific. As you can see, there are several possible solutions, but choosing exactly what solution would be best for you is dependent on a lot of unknown factors such as business requirements and ultimately the correct solution would be one that is sufficiently fast without sacrificing maintainability/sanity/hair.

Thanks a lot semchev. Really it is good answer. I wanna give some more points to your answer also..wanna put bounty again +1.. — N Sharma, May 03 '17 at 17:00
Thanks for the appreciation. I have dealt with this issue many times and it is easy to fall into a rabbit-hole of optimization and/or 'simplification'. Best of luck finding your optimal solution! — semchev, May 03 '17 at 18:55

score 1 · Answer 3 · answered May 03 '17 at 05:28

The data structure looks as if it is of NoSql db like MongoDb. Anyways, check if full text index helps you. I just saw FULL TEXT INDEX in MSSQL (https://learn.microsoft.com/en-us/sql/t-sql/statements/create-fulltext-index-transact-sql). I had no knowledge of this earlier. MongoDb also has full text index. Indexing will for sure help your queries if implemented properly. I am not sure how many full text index can be implemented on a single table. Please research on that.

How to improve performance of matching algorithm

3 Answers3