1

Yet another 'negation matching'/'match everything except' issue in Java Script.

So here's what I want to do:

I have a huge text file and I want to remove everything from the file except the username/password lines. The following is a sample part from the text:

<property name="password">QWERTY</property>
....lots of similar tags......
<property name="username">Hello</property>
<property name="passive">1</property>
<property name="password">Test Password</property>
<property name="scheme">smb</property>
<property name="timeout">10000</property>
<property name="username">RANDOM USER</property>
....lots of similar tags......
<property name="username">Sid</property>

I want to remove each and every line which is not the password or the username.

I tried the following replace function to at least start off with the password but it didn't seem to work:

incomingString = incomingString.replace(/[\W\w]*?(?=<property name="password">[\W\w]*?</property).*?/g,"");

Looking back I can understand there are far too many issues with the regex so I wished to know a working regex that would help me remove all the lines in the previously mentioned text and leave me with

<property name="password">QWERTY</property>
<property name="username">Hello</property>
<property name="password">Test Password</property>
<property name="username">RANDOM USER</property>
<property name="username">Sid</property>

PS: It is important that their order in the document should be maintained

I went through a few questions on SO about this unending issue in JavaScript regex (this and lookbehinds)but the answers were very specific to that particular case.

Any help would be appreciated.

Thanks.

Sid
  • 234
  • 1
  • 4
  • 14
  • 1
    Parsing XML with regex is almost as bad as [parsing HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). You might look at http://stackoverflow.com/questions/17604071/parse-xml-using-javascript – Matt Burland Sep 26 '14 at 15:16
  • 1
    Also, rather than "replace everything that doesn't match", might it not be easier to "extract the part that does match". – Matt Burland Sep 26 '14 at 15:19
  • @MattBurland: Thanks for the link. And I was almost sure I'd be told parsing XML with Regex is a terrible idea but I wasn't sure how else I'd have been able to do it. I'll use the link for that but I'd still like to know how JS regex deals with negations – Sid Sep 26 '14 at 15:20
  • @MattBurland: About extracting all that does match, that won't help me retain the order though, correct? I need to make sure if a username is say XYZ, his password must be ABC. That's actually the main reason behind this question. – Sid Sep 26 '14 at 15:22
  • Why wouldn't it? Any regex is going to return the matches in the order they were encountered in the original string. – Matt Burland Sep 26 '14 at 15:23
  • @MattBurland: Oh wait, bummer, I was going to write two separate regexes for the username and another one for the password. I get what you're saying. I'll give it a try with .match and get back to you. – Sid Sep 26 '14 at 15:25
  • 1
    I'd suggest *not* using regular expressions (though at least XML is semi-regular), which will come as no kind of surprise, I know; but given that, on first attempt, my non-regex best-effort seems to be atrociously verbose ([JS Fiddle demo](http://jsfiddle.net/davidThomas/nLd7m1gv/)), honestly, why not? RegEx looks so much nicer in the [anubhava's answer](http://stackoverflow.com/a/26062824/82548)... – David Thomas Sep 26 '14 at 15:39

2 Answers2

3

You can use this regex for String#match call:

/<property[^>]*name="(username|password)[^>]*>[^<]*</property>/gi

RegEx Demo

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    Thanks again. I remember you had helped me with negation matching last time around. I'm sorry I didn't mention, the xml could be on one line too. There are times I've seen that happen in the files. Secondly, this gives me the matches is there any way to get to the non-matched bit? For the replace function – Sid Sep 26 '14 at 15:17
  • Oh ok, is XML always using `QWERTY` type tags? – anubhava Sep 26 '14 at 15:27
  • Yes, every tag I need to extract will be either that with "password" or that with "username" – Sid Sep 26 '14 at 15:28
  • Still has the issue where two properties on the same line would still be matched. Like Hello1 – Sid Sep 26 '14 at 15:34
  • In case we ignore the case where the XML will be on line and we choose to use the regex you had proposed earlier, how would I negate it though? – Sid Sep 26 '14 at 15:35
  • 1
    @Sid: Those `.*` probably ought to be non-greed `.*?` to solve the problem of it matching the last `` rather than the next `` on the line. – Matt Burland Sep 26 '14 at 15:44
  • 1
    @anubhava: You need to fix the case where a line starts with a property the OP doesn't want as well as the case where it is followed by a property that isn't wanted. Example: `1Hello` – Matt Burland Sep 26 '14 at 15:57
1

Although I still think you are better off using an XML parser here, this should fix the one line problem:

<property[^>]*name="(username|password)".*?</property>

http://regex101.com/r/oM7aD2/1

You match the literal <property follow by any number of characters that aren't a literal > (this prevents you from matching if the first tag of the line isn't username or password) then the rest is the same as @anubhava's (although I took the liberty of adding the second literal " in case you encounter other properties that are prefixed with username or password - e.g. password_expires)

Matt Burland
  • 44,552
  • 18
  • 99
  • 171
  • It was nice to consider the _expires. While this does work. I personally thought replacing everything that doesn't match this would be easier in my scenario. But is your suggested alternative as follows - Use array = string.match with the said regex and use the array to generate the string again? – Sid Sep 26 '14 at 16:09