0

Perhaps regex is not the best way to parse this, tell me if I it is not. Anyway, here are some examples of what the syntax tree looks like:

(S (CC and))  
(SBARTMP (IN once) (NP otherstuff))   
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))   

Anyway, what I am trying to do is pull the connective out (and, then, once, etc) and its corresponding head (CC,IN,CC), which I already know for each syntax tree so it can act as an anchor, and I also need to retrieve its parent (in the first it is S, second SBARTMP, and third it is S), and its siblings, if there are any (in the first none, in the second left hand side sibling, and third left-hand-side and right-hand-side sibling). Anything higher than the parent is not included

my $pos = "(\\\w|-)*";  
my $sibling = qr{\s*(\\((?:(?>[^()]+)|(?1))*\\))\s*};  
my $connective = "once";  
my $re = qr{(\(\w*\s*$sibling*\s*\\(IN\s$connective\\)\s*$sibling*\s*\))};  

This code works for things like:

my $test1 = "(X (SBAR-TMP (IN once) (S sdf) (S sdf)))";  
my $test2 = "(X (SBAR-TMP (IN once))";  
my $test3 = "(X (SBAR-TMP (IN once) (X as))";  
my $test4 = "(X (SBAR-TMP (X adsf) (IN once))";  

It will throw away the X on top and keep everything else, however, once the siblings have stuff embedded in them then it does not match because the regex does not go deeper.

my $test = "(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))";  

I am not sure how to account for this. I am kind of new to the extended patterns for Perl, just started learning it. To clarify a bit about what the regex is doing: it looks for the connective within two parentheses and the capital-letter/- combo, looks for a complete parent of the same format closing with two parentheses and then should look for any number of siblings that have all their parentheses paired off.

MCH
  • 2,124
  • 1
  • 19
  • 34
  • 2
    Having written more parsers than I care to recall I can offer this observation: You can either parse one character at a time or you can parse incorrectly. Regular expressions may seem like a cheap way to parse at first glance but to get a regex parser to be correct and reliable is almost always more work than doing it the hard way. – sorpigal Dec 23 '10 at 13:23
  • 2
    You don't manipulate things like this with regexes. Even if it's theoretically possible due to some weird extension, the only sane way to deal with this is to re-construct the actual tree in memory and then traverse that. –  Dec 23 '10 at 13:34
  • Ok, that's what I thought, but I wanted to make sure. Anyway, this seems like a good place to start http://stackoverflow.com/questions/3693323/how-do-i-manipulate-parse-trees Any other suggestions? – MCH Dec 23 '10 at 14:09
  • We just had an example for parsing a few hours ago: http://stackoverflow.com/questions/4517721/parsing-of-a-logical-expression-and-converting-it-to-a-tree-in-perl – daxim Dec 23 '10 at 14:58

3 Answers3

1

To only get the nearest 'parent' to your anchor connective you can do it as a recursive parent with a FAIL or do it directly. (for some reason I can't edit my other posts, must be cookies being deleted).

use strict;
use warnings;

my $connective = qr/ \((?:IN|CC)\s(?:once|and|then)\)/x;
my $sibling = qr/
  \s*
  ( 
     (?! $connective )
     \(
        (?:
            (?> (?: [^()]+ ) )
          | (?-1)
        )*
     \)
  )
  \s*
 /x;

my $regex1 = qr/
      \( ( [\w-]+ \s* $sibling* \s* $connective \s* $sibling* ) \) #1
 /x;

my $regex2 = qr/
   ( #1
     \( \s*
        (  #2
           [\w-]+ \s*
           (?>   $sibling* \s* $connective (?(R)(*FAIL)) \s* $sibling*
               | (?1)
           )
        )
        \s*
     \)
   )
 /x;


my $sample = qq/
 (X (SBAR-TMP (IN once) (S sdf) (S sdf)))
 (X (SBAR-TMP (IN once))
 (X (SBAR-TMP (IN once) (X as))
 (X (SBAR-TMP (X adsf) (IN once))
 (X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))
 (S (CC and))  
 (SBARTMP (IN once) (NP otherstuff))   
 (S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
/;

while ($sample =~ /$regex1/xg) {
    print "Found:   $1\n";
}
print '-' x 20, "\n";

while ($sample =~ /$regex2/xg) {
    print "Found:   $2\n";
}

__END__
  • Thanks, works great, though I think I might try something out with Lisp since I may have to do more parsing in the future. – MCH Dec 27 '10 at 16:32
0

Why did you give up on this, you almost had it. Try this:

use strict;
use warnings;

 my $connective = qr/(?: \((?:IN|CC)\s(?:once|and|then)\) )/x;
 my $sibling = qr/
  \s*
  ( 
     (?!$connect)
     \(
        (?:
            (?> (?: [^()]+ ) )
          | (?-1)
        )*
     \)
  )
  \s*
 /x;

 my $regex = qr/
   ( #1
     \(
        \s* [\w-]+ \s*
        (?>   $sibling* \s* $connective \s* $sibling*
            | (?1)
        )
      \s*
     \)
   )
 /x;


my @tests = (
  '(X (SBAR-TMP (IN once) (S sdf) (S sdf)))',  
  '(X (SBAR-TMP (IN once))',
  '(X (SBAR-TMP (IN once) (X as))',
  '(X (SBAR-TMP (X adsf) (IN once))',
);

for my $sample (@tests)
{
    while ($sample =~ /$regex/xg) {
         print "Found:   $1\n";
    }
}

my $another =<<EOS;
(S (CC and))  
(SBARTMP (IN once) (NP otherstuff))   
(S
  (S
    (NP blah
      (VP blah)
    )
    (CC then)
    (NP blah
      (VP blah
        (PP blah)
      )
    )
  )
)
EOS

print "\n---------\n";
    while ($another =~ /$regex/xg) {
         print "\nFound:\n$1\n";
    }

END

  • Looking good, in my $sibling $connect should be $connective, I think. For the first example it finds (X ), which it shouldn't. Later to day I will sit down and read through the regex more clearly. – MCH Dec 24 '10 at 03:14
  • So you want the inner most parent only. Just add a conditional between $connection and last $siblings like this: (?> $sibling* \s* $connective (?(R)(*FAIL)) \s* $sibling* In that case you probably don't need the outer recursion, but leave it in incase you change your mind. Will try to add a proper full code in answer 2 if it will format for me. –  Dec 24 '10 at 04:36
0

This should work as well

use strict;
use warnings;

my $connective = qr/(?: \((?:IN|CC)\s(?:once|and|then)\) )/x;
my $sibling = qr/
  (?: \s*
  ( 
     (?!$connective)
     \(
        (?:
            (?> (?: [^()]+ ) )
          | (?-1)
        )*
     \)
  )
  \s* )
 /x;

my $regex = qr/
   ( #1
     \( \s*
        (  #2
           [\w-]+ \s*
           (?>   $sibling* \s* $connective (?(R)(*FAIL)) \s* $sibling*
               | (?1)
           )
        )
        \s*
     \)
   )
 /x;


my @tests = (
  '(X (SBAR-TMP (IN once) (S sdf) (S sdf)))',  
  '(X (SBAR-TMP (IN once))',
  '(X (SBAR-TMP (IN once) (X as))',
  '(X (SBAR-TMP (X adsf) (IN once))',
  '(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))',    
);

for my $sample (@tests)
{
    while ($sample =~ /$regex/xg) {
        print "Found:   $2\n";
    }
}

my $another = "
(S (CC and))  
(SBARTMP (IN once) (NP otherstuff))   
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
";

print "\n---------\n";
while ($another =~ /$regex/xg) {
    print "\nFound:\n$2\n";
}

__END__