5

We've build a system that analyzes some data and outputs some results in plain English (i.e. no charts etc.). The current implementation relies on lots of templates and some randomization in order to give as much diversity to the text as possible.

We'd like to switch to something more advanced with the hope that the produced text is less repetitive and sounds less robotic. I've searched a lot on google but I cannot find something concrete to start from. Any ideas?

EDIT: The data fed to the NLG mechanism are in JSON format. Here is an example about web analytics data. The json file may contain for example a metric (e.g. visits), it's value in the last X days, whether the last value is expected or not and which dimensions (e.g. countries or marketing channels) affected its change.

The current implementation could give something like this:

Overall visits in the UK mainly from ABC email campaign reached 10K (+20% DoD) and were above the expected value by 10%. Users were mainly landing on XXX page while the increase was consistent across devices.

We're looking to finding a way to depend less on templates, sound even more natural and increase the vocabulary.

Stergios
  • 3,126
  • 6
  • 33
  • 55
  • Well what kind of data are you working with, what do these results look like? Are you going for more natural-sounding sentences or is it just about mixing things up vocab-wise? – patrick May 24 '17 at 13:41
  • @patrick Edited my post above – Stergios May 24 '17 at 14:01
  • This is a pretty broad question, so I'm not sure it's a good fit for Stack Overflow. That said, why, in particular, do you want to get away from templates? Wouldn't more templates help to sound more natural and increase vocabulary while retaining ease of maintainability and testability? – Adrian McCarthy May 31 '17 at 22:12

2 Answers2

2

What you are looking for is a hot research area and a pretty tough task. Currently there is no way to generate 100% meaningful diverse and natural sentences. one approach to generate sentences is using n-grams. using these method you can generate sentences that look more natural and diverse that may look good but probably meaningless and grammatically incorrect. A more up to date approach is using Deep learning. anyway if you want to generate meaningful sentences, maybe your best way is using your current template based method. You can find an introduction to basics of n-gram based NLG here: Generating Random Text with Bigrams

this tool sounds to implement some of the most famous techniques for natural language generation: simplenlg

ayyoob imani
  • 639
  • 7
  • 16
1

Have you tried Neural Networks especially LSTM and GRU architectures? These models are the most recent developments in predicting sequences of words. Generating natural language means to generate a sequence of words such that it makes sense with respect to the input and earlier words in the sequence. This is equivalent to predicting time series. LSTM is designed for predicting time series. Hence, it is commonly used to predict a sequence of words, given an input sequence, an input word, or any other input that can be embedded in a vector.

Deep learning libraries such as Tensorflow, Keras, and Torch all have sequence to sequence implementations that can be used for generating natural language by predicting a sequence of words given an input.

Note that usually these models need a huge amount of training data.

You need to meet two criteria in order to benefit from such models:

  1. You should be able to represent your input as a vector.
  2. You need a relatively large amount of input/target pairs.
MAZDAK
  • 573
  • 1
  • 4
  • 16
  • Creating such a large number of input/target pairs would not actually be feasible. I would also need something less of a black box. – Stergios Jun 01 '17 at 18:08