2

I was going through my mails, and saw that gmail automatically suggested me to add coming friday around 5pm to an event on 21st Feb. I am surprised how gmail does this ? I mean how did it correctly figure out that this friday meant the coming friday, and also that the 5 PM is linked with Friday.

I am a newbie in NLP and machine learning, so if someone can explain it to me in layman terms I would be very glad

Anubhav Agarwal
  • 1,982
  • 5
  • 28
  • 40
  • hi! this question might also shed some light on your question: http://stackoverflow.com/a/9344555/583834 – arturomp Feb 18 '14 at 17:41

1 Answers1

4

I don't think this needs a lot of machine learning as such. A bit of NLP is helpful to get the dependencies from the sentence but even that isn't strictly necessary.

You could start off with just looking at keywords monday,tuesday etc. and then do a look around to see what is around them last monday, next monday, coming monday, previous monday and so on. These are called window features because they provide a window +/- 1,2,3 ... around the feature you are interested in monday. The around 5pm you could theoretically also get from just looking at window features, I don't have an intuition as to how noisy that would be. Try to think of all the ways of expressing time in that context and then think of those ways can be mixed up with something else. Of the top of my head it would seem relatively easy to do that.

Anyhow, the other way is to use a dependency parser to extract the grammatical relations of the elements in the sentence. This requires you to Part of Speech (POS) tag the sentence (after splitting it into tokens). The POS tagger would need to be trained to recognize that friday and monday are nouns, perhaps even that they are temporal expressions, same goes for 5pm and around 5pm. That does require machine learning and a lot of it. The benefit Google has as opposed to others is that they have a lot of data, which allows them to have lots and lots and lots of examples of different ways expressing what essentially is the same thing. This gives their models a lot of breadth. Once you've got the sentence POS tagged, you feed it to a dependency parser (such as the Stanford Dependency Parser) which tells you what the relation between all the different tokens in the sentence is.

Again Google has a lot of data which helps. On top of all this Google has had years to hone the output of the models so that when the models isn't entirely sure what is going on it won't highlight/extract the result. In terms of actually applying NLP in the real world this last step is very important because it given people confidence in what the system is doing. Basically if the software isn't sure what is happening do nothing, because doing something risks doing the wrong thing which then reduces people's confidence on the system as a whole.

Releasing a reliable easy to use NLP application requires a tradeoff between the quality of the NLP/Machine Learning and general software engineering to hide all the parts where the NLP fails from the users.

Try sending yourself email(s) with time expressed in different ways and see which ones Google gets and which ones it doesn't. For instance

  • Can we meet Friday next week?
  • How about coffee next week's Friday at 2pm
  • I can't do Friday but I can meet Wednesday at 4pm

and so on, it's always interesting to poke holes in technology. It can also reveal quite a lot about what it is doing, and how it is doing it.

Matti Lyra
  • 12,828
  • 8
  • 49
  • 67
  • yeah I thought about window features. But I guess the biggest issue is how do you stop the noise, I mean for statement like this "let's meet this friday or 6 PM tonight" or "let's meet this friday and 6 PM on thursday", how would I understand which day to associate 6PM with, I did try sending many of these to myself. Unfortunately for me gmail doesn't even parse the simple event like "Friday 6PM". – Anubhav Agarwal Feb 18 '14 at 12:36
  • @AnubhavAgarwal `let's meet this friday or 6 PM tonight` would certainly need dependency parsing to get right because relating the `or` is basically impossible otherwise. The parse tree would mark the or to be separating the two noun phrases around it, allowing you to tell the difference. – Matti Lyra Feb 18 '14 at 12:39
  • @AnubhavAgarwal do you mind accepting the answer then? – Matti Lyra Feb 18 '14 at 21:33
  • And Thunderbird will remind you not to forget your attachment if 'attach' is a word in your mail and you haven't attached a file. – Kun Wu Feb 19 '14 at 03:55