0

I'm trying to find lines that don't end with </url>

Here's the regexp I've got so far.

\<\/url\>

I can find the </url>, but I'm looking for lines that are missing that at the end.

My data looks like this:

<url><loc>blah1 blah1 blah1</loc></url>    
<url><loc>blah2 blah2 blah2    
<url><loc>blah3 blah3 blah3</loc></url>

In this example I'm trying to find the line that looks like blah2

Thanks in advance!

Jacob Eggers
  • 9,062
  • 2
  • 25
  • 43
s15199d
  • 7,261
  • 11
  • 43
  • 70
  • 1
    Have a look at http://stackoverflow.com/questions/1153856/string-negation-using-regular-expressions – Neel Aug 24 '11 at 17:24
  • Is this supposed to be html? Because in that case this will be impossible with regexes... – sg3s Aug 24 '11 at 17:25
  • You may find this page interesting: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – murgatroid99 Aug 24 '11 at 17:28
  • that might be possible to do with simple text function – Dor Aug 24 '11 at 17:54
  • I ended up writing something in .NET to parse the file and add the missing closing tags. Never got the regexp to work in the DreamWeaver find/replace. – s15199d Aug 25 '11 at 19:05

2 Answers2

1

This regex should work for you as long as there is no whitespace at the end of the line:

/^.*?(?<!<\/url>)$/

Edit Here's another more complex one to work around not having lookbehind:

^.*?([^>]|[^l].|[^r].{2}|[^u].{3}|[^/].{4}|[^<].{5})$
Jacob Eggers
  • 9,062
  • 2
  • 25
  • 43
  • 1
    +1 You posted the exact same thing mere seconds before I did. :-) – Wiseguy Aug 24 '11 at 17:39
  • 1
    Here's info on [negative lookbehinds](http://www.regular-expressions.info/lookaround.html), and here's [an example](http://refiddle.com/1as). – Wiseguy Aug 24 '11 at 17:41
  • @Wiseguy nice. And +1 for deleting it and keeping the question clean. – Jacob Eggers Aug 24 '11 at 17:41
  • I'm using this regexp in DreamWeaver to do a find and replace. DW threw this error when trying to use this regexp... invalid quantifier ?<!<\/url>)$/ – s15199d Aug 24 '11 at 19:42
  • @s15 Ah, I didn't realize you were trying to do this in dreamweaver. Sorry, I'm not familiar with their acceptable regex syntax. – Jacob Eggers Aug 24 '11 at 19:50
  • Dreamweaver doesn't support the negative look behind (it uses JavaScript RegExp), nor does it support multiline regexps, so start of line ( ^ ) and end of line ($) characters don't do anything. You may as well copy and paste your code into the negative look behinds example posted in another comment and run the code and it'll highlight what you're looking for. Also note that Dreamweaver's RegExp searches do not have the leading and trailing / as those are assumed when you check the use regular expression box – Danilo Celic Aug 24 '11 at 23:37
  • @Danilo: As I understand it, DW regexes are *always* multiline. That is, `^` always matches the beginning of a line and `$` always matches the end of a line. And of course, the (beginning/end) of the input is also the (beginning/end) of the (first/last) line. – Alan Moore Aug 25 '11 at 01:36
  • @Alan: It is my understanding that in the context of a RegExp, multiline would treat each line within the content as separate entries that are matchable with ^ and $, not that the entire content is multiple lines itself. See the multiline flag here: https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/regexp – Danilo Celic Aug 26 '11 at 23:26
  • For example, if Dreamweaver supported the multiline flag in the Find dialog, then to find every line in a file that starts with < you could use the following regexp: ^<[\W\w]*$ However, in Dreamweaver, a search in an HTML file that starts with <, then that will match the entire document, not individual lines within the document. – Danilo Celic Aug 26 '11 at 23:33
  • 1
    @Danilo: `^[\W\w]*$` will match the whole file because the `*` is greedy; to match individual lines you would use `^[\W\w]*?$`. I don't know Dreamweaver, but typically in editors with regex find/replace, multiline mode is the default. In EditPad Pro, you can't even turn it off; in the rare instances where you need to match *only* the very start or end, you can use `\A` and `\z`. It may be different in DW since JavaScript doesn't support those anchors, but I would still expect multiline mode to be the default. – Alan Moore Aug 27 '11 at 03:15
  • Fair enough point about the RegExp I chose to use as an example, but my point remains: Dreamweaver will not match individual lines when using ^ and $ (regardless of the actual expression you're using). Those characters are only useful for matching the **entire content** that is being searched through (which could be a single line of code if you are searching through selected text), but will never be individual lines within a multiple line block of code. – Danilo Celic Aug 27 '11 at 19:34
0

Try this:

^(?:(?!</url>$).)*$

This is effectively ^.*$, but each time the . is about to match, the lookahead tries to match </url> followed by the end of the line. If it succeeds, the match fails.

I notice, however, that the first two lines in your example end with several space characters. If that's true of your real data, you'll need to allow for it in the lookahead:

^(?:(?!</url>[ \t]*$).)*$

This will also match a completely empty line. If you want to require at least one character, you can change the * to +:

^(?:(?!</url>$).)+$

...or you might want to match only lines that start with <url>:

^<url>(?:(?!</url>$).)$
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • There is no whitespace at the end of the lines of the data I'm working with. Unfortunately none of these regexp worked. – s15199d Aug 25 '11 at 12:33