Automatically generating a regular expression

Question

I want the following function:

def get_pattern_and_replacement(the_input, output):
    """
    Given the_input and output returns the pattern for matching more general case of the_input and a template string for generating the desired output.

    >>> get_pattern_and_replacement("You're not being nice to me.", "I want to be treated nicely.")
    ("You're not being (?P<word>\w+) to me.", "I want to be treated {{ word }}ly.")
    >>> get_pattern_and_replacement("You're not meeting my needs.", "I want my needs met.")
    ("You're not meeting my (?P<word>\w+).", "I want my {{ word }} met.")
    """

This is for a program to transform undesired text into desired text.

With help from Stackoverflow users my function is now:

def flatten(nested_list):
    return [item for sublist in nested_list for item in sublist]

def get_pattern_and_replacement(the_input, output):
    """
    Given the_input and output returns the pattern for matching more general case of the_input and a template string for generating the desired output.

    >>> get_pattern_and_replacement("You're not being nice to me.", "I want to be treated nicely.")
    ("You're not being (?P<word>\w+) to me.", "I want to be treated {{ word }}ly.")
    >>> get_pattern_and_replacement("You're not meeting my needs.", "I want my needs met.")
    ("You're not meeting my (?P<word>\w+).", "I want my {{ word }} met.")
    """
    input_set = set(flatten([[the_input[i: i + j] for i in range(len(the_input) - j) if not ' ' in the_input[i: i + j]] for j in range(3, 12)]))
    output_set = set(flatten([[output[i: i + j] for i in range(len(the_input) - j) if not ' ' in output[i: i + j]] for j in range(3, 12)]))

    intersection = input_set & output_set
    intersection = list(intersection)
    intersection = sorted(intersection, key=lambda x: len(x))[::-1]
    print intersection
    pattern = the_input.replace(intersection[0], '(?P<word>\w+)')
    replacement = output.replace(intersection[0], '{{ word }}')
    return (pattern, replacement)

Please at least include some "undesired" and some "desired" text. — , May 16 '13 at 13:36
I don't know how to tackle the problem of figuring out the difference between the input and output. — Timothy Clemans, May 16 '13 at 13:40
Do you want a program that can understand and interpret natural languages and suggest rephrasings? Or do you imagine a specific set of restrictions on the input and output? — Mikkel, May 16 '13 at 13:42
By what rule did you decide that the second example should become "You're not meeting my */I want my * met" and not "You're not * my needs"/"I want my needs *"? Is the idea to hook this up to something like NLTK and do part-of-speech analysis? — DSM, May 16 '13 at 13:42
What are you trying to do is really difficult : you have to create a grammar of regex and a solver. Regex are not obvious — lucasg, May 16 '13 at 13:42
To figure out the differences (or similarities between input and output you first need to define what a "difference" is. For example, the letter "e" occurs many times in both your input and output examples. — Mikkel, May 16 '13 at 13:43
That's the ultimate the goal. For now I'm using regular expressions. I want users to be able to put in example input and output and get the regular expression and replacement template. — Timothy Clemans, May 16 '13 at 13:43
@Mikkel : I think he wants for the function, given a part of the phrase with its replacement, construct the regex which can do so. — lucasg, May 16 '13 at 13:44
OK so if my similarity is 3 letters or more than what do I do to find that similarity? — Timothy Clemans, May 16 '13 at 13:45
You can iteratively get substrings of 3 characters from the first string and try to find that substring in the other string. See http://docs.python.org/release/2.2.1/lib/string-methods.html and http://stackoverflow.com/questions/663171/is-there-a-way-to-substring-a-string-in-python — Mikkel, May 16 '13 at 13:51
If you have a morphological analyzer for English, you could identify that "nicely" is an adverbial derivation of "nice" and that "need" is the same word in both sentences. Down the line, you will bump into ambiguous situations fairly quickly. Already it is unclear to me by what logic you would hope to not pick out "I"/"me", "be"/"being", and "my" as "common" words from your examples, but maybe some sort of topic analysis could help. — tripleee, May 16 '13 at 15:33

score 2 · Accepted Answer · answered May 16 '13 at 14:10

If you want this kind of template transformations, you've got to write them yourself. Recognizing the common parts is a matter of common sense, practice, and creativity; no general rule can do it for you. But you'll have to read a tutorial on regular expressions, and it'll probably help you think through the problem.

You should probably check out the source code for Eliza, the famous chatbot that started it all. Here's the source to a python version. As you'll see, the conversational rules are hand-written.

If you're hoping for an algorithm that will generate templates like the examples you included: That's a very, very hard problem with no single reasonable solution. Forget it. Go read a regexp tutorial instead.

Automatically generating a regular expression

1 Answers1