0

I'm usually pretty adept with algorithms, but I've got a pretty abstract question here, which is probably someone's PhD project somewhere, and bordering on NP completeness. But maybe it's a more common problem than I think.

I have a list of 25000 Strings, created using a bunch of drop down lists and text fields. So, to simplify the discussion, lets say this is the, er, unidirectional graph:

{My Cat/My Dog} had {kittens, puppies}.

So, this is like a tree structure whose 4 paths represent 4 possible sentences.

How would one reverse engineer the tree structure from a (possibly incomplete) list of sentences?

i.e.

So that from
My Cat had kittens
My Cat had puppies
My Dog had kittens
, you could still recreate the original syntax tree.

Obviously with 25000 Strings, this will take a while. But is there any software out there that does this? Or, is this a common enough problem that there are known algorithms to do this?

It seems like a regex parser in nature, but I don't know where to begin. I'm dealing with this problem at work, and doing my own analysis of the sentences to parse another 500 or so, every time I find a new pattern. But I reckon if I had the tree structure, I could do it chop chop.

Any ideas? Thanks

djb
  • 1,635
  • 3
  • 26
  • 49
  • If it's incomplete, I don't know enough about AI to figure out how it could rebuild the missing links. From the example you gave, it's not clear that dogs can have puppies -- unless, if they have one same child, they have all same children? What are the rules, how does this behave? – Trevoke Jan 29 '10 at 13:25
  • If it's incomplete, how do you know that *any* sentance that has the word *"had"* in it doesn't point to that particular *"had"* -node, and that the rest of the sentences to prove it aren't just missing? – BlueRaja - Danny Pflughoeft Jan 29 '10 at 14:38
  • If you can't tolerate any false positives, this is clearly impossible. Even if you can, it seems to me you would also have to provide counterexamples for any kind of syntax discovery algorithm. Otherwise it's easy to make a grammar that's too accepting - as for example, in your case, `/(My|Cat|Dog|had|kittens|puppies)*` or for that matter just `/.*/`. – Jason Orendorff Jan 29 '10 at 21:36

4 Answers4

2

Could templatemaker be a step in the right direction for you? Its goal is to infer the templates behind similarly formatted strings, later allowing you to use this template to extract the data from other strings.

Eli Bendersky
  • 263,248
  • 89
  • 350
  • 412
  • thanks, that looks pretty good. I just realised that whatever algorithm diff uses is probably also pretty much what I am looking for. – djb Jan 29 '10 at 13:26
  • @djb: templatemaker uses a kind-of superset of diff - i.e. same longest-common-substring building blocks – Eli Bendersky Jan 29 '10 at 13:33
2

This may come under the heading of learning Finite Automata, in which case it is genuinely a hard problem, at least with the standard assumptions of that field. However, I suspect that your case is easier than most, because you know that the machine is in a single start state at the beginning if each string.

If looking up learning Finite Automata is too depressing, you could just get hold of some code for fitting Hidden Markov Models, let it loose, and hope for the best.

mcdowella
  • 19,301
  • 2
  • 19
  • 25
1

But maybe it's a more common problem than I think.

I believe that this is known as grammar inference or grammar induction.

Rafał Dowgird
  • 43,216
  • 11
  • 77
  • 90
0

Your intuition about regular expression could be right. This is a typical setting for Grammar Induction: induce("find") a set of rules that allow you to generate/recognize a set of strings.

Typically, a tree is a good structure to visualize and manipulate this kind of rules.

One first question is: are your strings so regular? (Answer to this question is not so easy, an operative way could be to try and see by human inspection if the inferred grammar fulfill your goal). If the simplicity of structure of your examples suggests this approach,than you can adopt a regular grammar induction.

For some ready-to-use libraries see:

Community
  • 1
  • 1
Gabrer
  • 465
  • 5
  • 21