3

I am trying to run the following search (with . made to match newlines either by adding the /s flag in perl or replacing it with \_. in vim):

/<output_channels>.*(?=Story).*?<\/output_channels>/

However the ? isn't turning off greed as it normally does - can anyone explain why? For example, it matches the entire contents of the following file rather than just the first element:

<output_channels>
  <output_channel>RSS</output_channel>
  <output_channel>Story</output_channel> 
</output_channels>

<output_channels>
  <output_channel>RSS</output_channel>
</output_channels>

Sorry if I'm missing something obvious.

tog22
  • 486
  • 1
  • 4
  • 21
  • So, are you using Perl regex or vim's regex search/replace? – BoltClock Apr 15 '11 at 09:54
  • The RE you give uses a couple of elements that don't work in vim. Not sure if you realize this or not. Check [`:help perl-patterns`](http://vimdoc.sourceforge.net/htmldoc/pattern.html#perl-patterns) for a list of differences. What are you using to do the search? – intuited Apr 15 '11 at 09:58
  • @BoltClock Both/either. Ultimately I'll use perl but I find it quicker to text regexes in vim. – tog22 Apr 15 '11 at 11:21

2 Answers2

1

The first .* in your regex is still greedy. You only added ? after the second one.

Avi
  • 19,934
  • 4
  • 57
  • 70
1

I put your sample text into a vim buffer, and then executed the command

:%!perl -e '$text = join("", <STDIN>); $text =~ /<output_channels>.*(?=Story).*?<\/output_channels>/s; print $&;'

The result is just the first block of XML. I think this is what you want?

Note that I escaped the / within the regex. Other than this, it is the same one given in your question.

Also note that the equivalent vim RE would be (tested, works):

<output_channels>\_.*\(story\)\@=\_.\{-}<\/output_channels>

See :help perl-patterns for a rundown of the differences between perl and vim REs.

Further note that parsing heirarchical markup with regexps has been known to reawaken ancient demons.

Community
  • 1
  • 1
intuited
  • 23,174
  • 7
  • 66
  • 88
  • Thanks. For what it's worth, your vim RE doesn't work - it'd be nice to know one I could use while testing in vim, but the perl RE is all I really need. – tog22 Apr 15 '11 at 12:41
  • ...though can you explain why the following doesn't work as intended (to capture just the second element in my file) when I switch to a negative lookahead. I have a feeling its to do with the greediness of the first .* but when I switch this to .*? I get an operator. Is there a way to capture elements not containing 'Story' or am I better off using a tool other than regexps? /\_.*\(story\)\@<!\_.\{-}<\/output_channels>/ – tog22 Apr 15 '11 at 15:13
  • @tog22: I just tested the vim RE and found that it works okay with both [`/`](http://vimdoc.sourceforge.net/htmldoc/pattern.html#/) and [`matchstr()`](http://vimdoc.sourceforge.net/htmldoc/eval.html#matchstr()). Note that in vim you don't need to (and mustn't) surround an RE with `/` characters; I just left those in to make it similar to the perl-ish version. I've taken them out. – intuited Apr 15 '11 at 17:13
  • 1
    @tog22: If you want to match a string of text that does not contain a particular submatch, you have to use something like `((?!submatch).)*`. I.E. you specify that each position does not match the thing you're avoiding. – intuited Apr 15 '11 at 17:18
  • @intuited Really appreciate your help, but I'm afraid you'll have to talk me through it more slowly. In what complete expression would I use `((?!submatch).)*` to find an `` element not containing 'submatch'? I tried `.*?((?!Story).)*?<\/output_channels>` but this matches both blocks... – tog22 Apr 18 '11 at 10:17
  • @tog22: It would be much easier to use an XML parser to do this sort of thing. I'm not sure what's available in perl, but I find that Python's `lxml` module is very good. It mostly consists of bindings to a C library, so perl may have the same. – intuited Apr 18 '11 at 17:41