0

In previous question, I have asked multiple matching patterns. Now my question is:

I have a few matching patterns:

$text =~ m#finance(.*?)end#s; (1)

$text =~ m#<class>(.*?)</class>#s; (2)

$text =~ m#/data(.*?)<end>#s; (3)

$text =~ m#/begin(.*?)</begin>#s; (4)

I want to match (1), (2) and (3) first. However, after matching (1) or (2), if (4) appears before another (1) or (2), then do not match (3) but only (4). So essentially (4)'s appearance excludes (3) from being matched. But in the case no (4) appears, (3) is matched. Is there any good way to do this?

Many thanks.

Community
  • 1
  • 1
Qiang Li
  • 10,593
  • 21
  • 77
  • 148
  • It seems you are trying to parse a non-regular language with regular expressions. What kind of data is that? – matthias krull Mar 18 '11 at 01:28
  • @mugen: why did you say "parse a non-regular language with regular expressions"? Can you please elaborate a bit? thanks. – Qiang Li Mar 18 '11 at 02:19
  • I want to say that your data looks like XML, HTML or some similar language. Those are Chomsky type 2 with context free grammar. A regular grammar is type 3 and generates a regular language. Regular expressions can not be used to parse type 2 languages in a sane way. There is a famous question about that.. i will look for the link later. – matthias krull Mar 18 '11 at 12:41
  • 1
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – matthias krull Mar 18 '11 at 12:50

1 Answers1

1

There's one unclear point in your specification: is suppression of (3) only from matching (4) to matching (1)/(2), or wider in scope?

In any case, that one's best resolved with a state machine.

my $state = 0;
while ($text =~ m#(?: finance (.*?) end
                  |   <class> (.*?) </class>
                  |   data    (.*?) </end>
                  |   begin   (.*?) </begin>
                  )
                 #sgx) {
  if (defined $1) {
    $state = ($state & ~4) | 1;
    print $1;
  }
  elsif (defined $2) {
    $state = ($state & ~4) | 2;
    print $2;
  }
  elsif (defined $3 and !($state & 4)) {
    print $3;
  }
  elsif (defined $4) {
    print $4;
    if ($state & 3) { # 1 OR 2
      $state = 4; # set 4, clear 1 and 2
    }
  }
  else {
    die 'Someone modified me without extending the state machine!';
  }
}

(This is syntax checked, but not tested; it's complex enough that a sample data set would be useful.)

geekosaur
  • 59,309
  • 11
  • 123
  • 114
  • @geekosaur: how did you do syntax checking with perl? thanks a lot. – Qiang Li Mar 18 '11 at 01:55
  • @Qiang Li: `perl -c script.pl`. `use`s will be executed (necessarily, since they change how other things are parsed), the main script will simply be checked for valid syntax. (Which, for Perl, is often [less than half of the story.](http://calculist.blogspot.com/2006/02/nancy-typing.html "The Little Calculist: Nancy typing")) Perl 5.12.3, but should work back to at least 5.005_03.) – geekosaur Mar 18 '11 at 01:58
  • @geekosaur: yes, i know this -s flag. I thought that you used some IDE. I have been trying to find some perl IDE, but could not. The thing I want most is: given existing perl code, how to reformat or beautify it? – Qiang Li Mar 18 '11 at 02:00
  • No, no IDE; I actually composed all of those inside the textarea, with nothing but standard Safari controls. But hard-to-read code does nobody any favors — least of all me, when I'm working in such a minimal editing environment. You might look at [`perltidy`](http://perltidy.sourceforge.net/) for general reformatting. – geekosaur Mar 18 '11 at 02:05
  • Also, Perl is idiosyncratic enough (see previous comment about `use` changing the way things afterward are parsed, for just one example; also, note what SO's code highlighting did when it saw `#` used as pattern delimiters) that a truly useful IDE is probably a lost cause. Even Emacs's `cperl-mode` still gets confused regularly in my experience. – geekosaur Mar 18 '11 at 02:07
  • 1
    Qiang Li: try Padre IDE, it has support for perlcritic and perltidy as plugins. – Alexandr Ciornii Mar 18 '11 at 11:04