-2

I am working on a hotel reservation system. My task is to implement an algorithm that will give right suggestions when the name of a hotel is typed wrong. For example if the user types the name of the hotel as "MOFENBICK" instead of it's real name "MOVENPICK" then my algorithm should suggest "did you mean MOVENPICK". I plan to implement it using Machine Learning Ideas. What would a good choice of features be for this problem ?

CASE
  • 43
  • 8
  • Possible duplicate of [How does the Google “Did you mean?” Algorithm work?](https://stackoverflow.com/questions/307291/how-does-the-google-did-you-mean-algorithm-work) – m7913d Aug 19 '17 at 09:48
  • You are implementing a hotel reservation system in GNU Octave or MATLAB? I would have a look at leventhstein, for example https://octave.sourceforge.io/strings/function/editdistance.html keywords to search might be "fuzzy search" https://en.wikipedia.org/wiki/Approximate_string_matching – Andy Aug 19 '17 at 10:34
  • I plan to implement a prototype in Octave first. I wish to develop from scratch. What I intend to do is make a neural network or use linear regression to train with datasets so that it can predict the output from test or validation set. Since I am new to machine learning I have difficulty choosing the features for the neural network or for the linear regression model. – CASE Aug 19 '17 at 10:47
  • m7913d gave you the link how google does it. But you might not have the training data google has to using some sort of neural network here doesn't make sense for me. Have you read the link for approcimate string matching I gave you earlier? – Andy Aug 19 '17 at 19:40

1 Answers1

1

You don't need to go as far as implementing a neural network. That's overkill for this particular task.

As suggested, use Levenshtein-distance. The idea behind Levenshtein distance is that it defines a metric over strings. In simpler terms, it allows a computer-algorithm to say "mofenbick" and "movenpick" are at distance 2 (because 2 letters were changed).

Some pseudo-code to calculate Levennshtein:

function LevenshteinDistance(char s[1..m], char t[1..n]):

    // create two work vectors of integer distances
    declare int v0[n + 1]
    declare int v1[n + 1]

    // initialize v0 (the previous row of distances)
    // this row is A[0][i]: edit distance for an empty s
    // the distance is just the number of characters to delete from t
    for i from 0 to n:
        v0[i] = i

    for i from 0 to m-1:
        // calculate v1 (current row distances) from the previous row v0

        // first element of v1 is A[i+1][0]
        //   edit distance is delete (i+1) chars from s to match empty t
        v1[0] = i + 1

        // use formula to fill in the rest of the row
        for j from 0 to n-1:
            if s[i] = t[j]:
                substitutionCost := 0
            else:
                substitutionCost := 1
            v1[j + 1] := minimum(v1[j] + 1, v0[j + 1] + 1, v0[j] + substitutionCost)

        // copy v1 (current row) to v0 (previous row) for next iteration
        swap v0 with v1

    // after the last swap, the results of v1 are now in v0
    return v0[n]

Once you have a metric defined over Strings, you need a fast way to query a list of hotels of course. The naïve implementation would be 1. iterate over all hotel names in the database/set 2. calculate the Levenshtein distance between the given input and the hotel name 3. pick the name that yields the smallest edit distance

Although this works fine for small sets, you can optimize this even further, using a BK-Tree.

reading material:

Joris Schellekens
  • 8,483
  • 2
  • 23
  • 54
  • Thank you very much. I will try to implement this. But I also intend to make a solution using neural networks since I wish to develop my skills in Machine Learning too. So if I were to implement this using neural networks what would be the input features in each input that would probably give me good learning curve? – CASE Aug 19 '17 at 16:47
  • Besides that wouldn't it be computationally too expensive to iterate over all the hotel names which would be thousands in number every time the user makes and entry? If we could get ML parameters all we need is a multiplication with the parameter matrix and take the sigmoid and map highest index to the particular class. I am newbie to programming. So my argument may be wrong. – CASE Aug 19 '17 at 17:17
  • @Joris: I'm curious why you're adding pseudo code if there are so many real implementations for GNU Octave, for example what I've linked above https://sourceforge.net/p/octave/strings/ci/default/tree/inst/editdistance.m also the odepkg has levenshtein distance – Andy Aug 19 '17 at 19:46
  • @Fawaz.A.R: It is computationally expensive to iterate over such a list. That's why my post suggests you use a BK-tree. – Joris Schellekens Aug 19 '17 at 22:56
  • @Andy: I always try to provide a code-sample in my answers. It helps the user out a lot more than just linkfarming. It also saves the user some time, seeing as I already made a selection as to which implementation I find most useful. – Joris Schellekens Aug 19 '17 at 22:57
  • @Fawaz.A.R: Part of developping your skills is learning to use the right tool for the the job. In this case, neural networks simply aren't the right tool. However, there are a lot of related use-cases that could be implemented with machine learning. – Joris Schellekens Aug 19 '17 at 23:06
  • Can you suggest me a fast way to loop over all the hotel names and calculate their Levenshtein-distance? Please note that there are around 2 million hotel names that has to be checked. – CASE Aug 20 '17 at 11:22
  • Like I have said in the main answer. Use a BK-tree. – Joris Schellekens Aug 20 '17 at 15:31