1

Part of a homework problem I am working on is as follows:

"Build a regular expression that captures all nonempty sequences of letters other than file, for or from. For notational convenience, you may assume the existence of a not operator that takes a set of letters as argument and matches any other letter."

The answer I came up with:

not(f[ile|or|rom])

Maybe I am paranoid, but that seems too easy? I would appreciate any comment, I know this cannot be correct.

Destruktor
  • 463
  • 1
  • 4
  • 19
  • 2
    What regex flavour do you use? – h2ooooooo Feb 03 '14 at 18:36
  • @h2ooooooo good question, the question is from the text "Programming Language Pragmatics" (Micheal Scott), and there is not an emphasis on any one flavor. As I am not very familiar with using regular expressions, I don't have a specific preference. The question is really from more of a theoretical POV. – Destruktor Feb 03 '14 at 18:42
  • 1
    perhaps [this](http://stackoverflow.com/questions/406230/regular-expression-to-match-string-not-containing-a-word) will help you. ie. `^((?!f(ile|or|rom)).)*$` – tenub Feb 03 '14 at 18:42
  • The homework question is a very badly phrased, incomplete description. Who is your teacher? –  Feb 03 '14 at 18:44
  • 2
    @sln As far as I can conclude, you can blame [Michael L. Scott](http://www.amazon.com/Programming-Language-Pragmatics-Third-Edition/dp/0123745144) for this horrible question. – h2ooooooo Feb 03 '14 at 18:45

4 Answers4

5

[ile|or|rom] will match i, l, e, |, o, r or m once as characters inside [ and ] are character groups. [as]{3} would match aaa, sss, asa, sas etc.

(ile|or|rom) will match ile, or or rom.

Perhaps you're looking for

not( f(ile|or|rom) )
h2ooooooo
  • 39,111
  • 8
  • 68
  • 102
2

This one uses no special (negative) look-a-head syntax. It does branching to exclude the invalid states. I created it only for file, since it is a bit long.

^(f(i(l($|e.+$|[^e].*)|$|[^l].*$)|$|[^i].*)|[^f].*$)

This is the (deterministic) automaton for the regexp (made using Regexper):

This uses look-a-head:

^(?!f(ile|or|rom)$)[a-z]+$
Stefan Ollinger
  • 1,577
  • 9
  • 16
  • 1
    +1 - Good effort. I started off trying to do it like your top regex. Realized its nightmarish, kind of like keyboard event validation. Went with a lookbehind approach. –  Feb 03 '14 at 19:54
  • Yes, the problem is nightmarish when defined like this, I think that was the point of the exercise - to show how difficult it would be to define regular expressions to find keywords in programming languages. thanks again! – Destruktor Feb 05 '14 at 15:58
1

It is difficult to answer a question to such a hypothetical regex flavour. 2 things:

  1. characters inside of square brackets are defining a character class. if you want to match a "f" followed by "ile" or "or" or "rom", use a normal group

    f(ile|or|rom)
    
  2. Assuming not(f(ile|or|rom)) is matching any character, that is not part of any of those words, then you need a quantifier to match repeated characters.

    not(f(ile|or|rom))+
    

    + repeats the item before one or more times

Bonus:

A real world solution using a negative lookahead assertion would be

\b((?!f(ile|or|rom)\b)\w)+\b

See it here on Regexr

stema
  • 90,351
  • 20
  • 107
  • 135
1

A backwards implementation to get up to f(il, o, ro)

 #  (?s)(?:.(?<!file)(?<!for)(?<!from))+

edit
Using a lookbehind is forever pathological.
So to save face, below are 2 ways I know to do it in a fairly simple way.

The first is to use split, which is straight forward.

 (?<=fil)(?=e)|(?<=fo)(?=r)|(?<=fro)(?=m)  

The second way is fairly simple. Find up until beginning of file|for|from
then match any remaining fil|fo|fro.
This will match every character, something a lookbehind won't do.

Example using both split and the straight regex are in the test case.

Regex explained

 #  (?s)(?:(?!file|for|from).())*(?:(?:fil|fo|fro)())?(?=\1|\2)

 (?s)                         # Dot-All
 (?:                          # Optional group, do many times
      (?! file | for | from )     # Lookahead, not 'file', 'for', 'from'
      .                           # Match this character
      ( )                         # Set a Group 1 flag (empty, but defined)
 )*
 (?:                          # Optional group, do once
      (?: fil | fo | fro )        # 'fil'(e), 'fo'(r), 'fro'(m)
      ( )                         # Set a Group 2 flag (empty, but defined)
 )?
 (?= \1 | \2 )                # See if we matched at least 1 character
                              # (this could be done with a conditional,
                              #  but not all engines have it)

Perl test case.

 $/ = undef;
 $str = <DATA>;

 # Using Split()

 my @ary = split(/(?<=fil)(?=e)|(?<=fo)(?=r)|(?<=fro)(?=m)/, $str);

 for (my $i = 0; $i < @ary; $i++)
 {
    print $ary[$i],"\n";
 }
 print "----------\n";

 # Using just Regex

 while ($str =~ /(?s)(?:(?!file|for|from).())*(?:(?:fil|fo|fro)())?(?=\1|\2)/g ) 
 {
    print $&, "\n";
 }
  __DATA__
 this file is a frozen filled football from Steve, for trackingfromforfile

Output >>

 this fil
 e is a frozen filled football fro
 m Steve, fo
 r trackingfro
 mfo
 rfil
 e
 ----------
 this fil
 e is a frozen filled football fro
 m Steve, fo
 r trackingfro
 mfo
 rfil
 e