2

This question is to learn and understand whether a particular technology exists or not. Following is the scenario.

We are going to provide 200 english words. Software can add additional 40 words, which is 20% of 200. Now, using these, the software should write dialogs, meaningful dialogs with no grammar mistake.

For this, I looked into Spintax and Article Spinning. But you know what they do, taking existing articles and rewrite it. But that is not the best way for this (is it? let me know if it is please). So, is there any technology which is capable of doing this? May be semantic theory that Google uses? Any proved AI method?

Please help.

Dongle
  • 602
  • 1
  • 8
  • 18
  • 2
    I smell something fishy... – Danny Beckett Dec 06 '13 at 08:10
  • @DannyBeckett: Why is that?\ – Dongle Dec 06 '13 at 08:13
  • Because it's how a lot of spam is written. – Danny Beckett Dec 06 '13 at 08:14
  • @DannyBeckett: Great, so such a technology exists. This is not spamming, helping poor students with no english knowledge to learn it, without spending money on teachers – Dongle Dec 06 '13 at 08:48
  • Ah fair enough! Good luck!! – Danny Beckett Dec 06 '13 at 08:49
  • 2
    Seems like giving the poor students some children's books would be a much better idea. – Don Reba Dec 06 '13 at 08:55
  • 3
    @DonReba: No, different people do different things. Lets stop this discussion from here since it is out of topic – Dongle Dec 06 '13 at 08:57
  • For English it can be doable to some extent I think. If you have a middle-sized text input, you could try a Markov chain-based model. A simple implementation is that where you create groups of words and check that usually what comes after that given group. Here's a simple chatbot for IRC I wrote a few years ago, take a look, it has some description and references for the details: https://github.com/rlegendi/edu.lro.shapeshifter – rlegendi Dec 06 '13 at 09:27
  • @rlegendi: Thanks, interesting. I am reading it now. – Dongle Dec 06 '13 at 11:49
  • There are quite a few approaches, but they are pretty data-intensive. A lot of data is required to "train" the learning algorithm before it can generate articles. Using small amount of data for training will produce nonsensical articles. Are you looking for research-like approaches, or readily available tools? – Chthonic Project Dec 06 '13 at 14:19
  • @ChthonicProject: Thanks for the reply. we are looking for tools and research like approachs, both. Please help – Dongle Dec 06 '13 at 14:33

1 Answers1

9

To begin with, a word of caution: this is quite the forefront of research in natural language generation (NLG), and the state-of-the-art research publications are not nearly good enough to replace human teacher. The problem is especially complicated for students with English as a second language (ESL), because they tend to think in their native tongue before mentally translating the knowledge into English. If we disregard this fearful prelude, the normal way to go about this is as follows:

NLG comprises of three main components:

  1. Content Planning
  2. Sentence Planning
  3. Surface Realization

Content Planning: This stage breaks down the high-level goal of communication into structured atomic goals. These atomic goals are small enough to be reached with a single step of communication (e.g. in a single clause).

Sentence Planning: Here, the actual lexemes (i.e. words or word-parts that bear clear semantics) are chosen to be a part of the atomic communicative goal. The lexemes are connected through predicate-argument structures. The sentence planning stage also decides upon sentence boundaries. (e.g. should the student write "I went there, but she was already gone." or "I went there to see her. She has already left." ... notice the different sentence boundaries and different lexemes, but both answers indicating the same meaning.)

Surface Realization: The semi-formed structure attained in the sentence planning step is morphed into a proper form by incorporating function words (determiners, auxiliaries, etc.) and inflections.

In your particular scenario, most of the words are already provided, so choosing the lexemes is going to be relatively simple. The predicate-argument structures connecting the lexemes needs to be learned by using a suitable probabilistic learning model (e.g. hidden Markov models). The surface realization, which ensures the final correct grammatical structure, should be a combination of grammar rules and statistical language models.

At a high-level, note that content planning is language-agnostic (but it is, quite possibly, culture-dependent), while the last two stages are language-dependent.

As a final note, I would like to add that the choice of the 40 extra words is something I have glossed over, but it is no less important than the other parts of this process. In my opinion, these extra words should be chosen based on their syntagmatic relation to the 200 given words.

For further details, the two following papers provide a good start (complete with process flow architectures, examples, etc.):

  1. Natural Language Generation in Dialog Systems
  2. Stochastic Language Generation for Spoken Dialogue Systems

To better understand the notion of syntagmatic relations, I had found Sahlgren's article on distributional hypothesis extremely helpful. The distributional approach in his work can also be used to learn the predicate-argument structures I mentioned earlier.

Finally, to add a few available tools: take a look at this ACL list of NLG systems. I haven't used any of them, but I've heard good things about SPUD and OpenCCG.

Chthonic Project
  • 8,216
  • 1
  • 43
  • 92
  • Hi, Thank you a lot. This is great info! I will get back to you in case I need further help – Dongle Dec 07 '13 at 18:09