2

I have to remove a set of lines that start with a marker and end with another marker. I want to find all such pieces of text and remove them using regex. The problem is, regex only matches one line at a time. How should I proceed?

ohaal
  • 5,208
  • 2
  • 34
  • 53
Amogh Talpallikar
  • 12,084
  • 13
  • 79
  • 135
  • Could you please show the regexp you've already tried? Without that I can suggest that you're looking for the `m` (multiline) modifier in your regexp. – Minras Feb 22 '12 at 09:31

4 Answers4

4

In most regex parsers, you can add a s to the end as a "dotall" modifier. This will make . match anything, including newlines (which it normally does not match).

But the dotall modifier does not exist in javascript. Instead, you have a "pseudo-dotall" modifier by using a predefined character class and its negation -- collectively these two things will match anything, including a newline. The canonical example is [\s\S] (match anything that is whitespace or anything this is not whitespace = match anything). But any character class and its negation will do (e.g. [\d\D] will also work).

So in your case, if your start token is S and your end token is E you can do this:

string.replace(/S[\s\S]*?E/g, '')

Two notes: I am using the g or global modifier to replace all instances. And in [\s\S]*?, the ? means "match the shortest sequence" (non-greedy). That way it really will be instances of delimited tokens rather than treating all the stuff between the first begin token and last end token as a single token.

Ben Lee
  • 52,489
  • 13
  • 125
  • 145
  • By the way, `s` modifier doesn't exist in Javascript. People suggest using `[\s\S]` instead of it (http://stackoverflow.com/questions/1068280/javascript-regex-multiline-flag-doesnt-work). – Minras Feb 22 '12 at 09:36
  • @Minras, I already update my answer to show that before you posted your comment ;) – Ben Lee Feb 22 '12 at 09:37
  • Sorry then, I haven't noticed that. – Minras Feb 22 '12 at 09:38
  • wont a dot do ? why do we want \s and \S ? – Amogh Talpallikar Feb 22 '12 at 09:54
  • 1
    @AmoghTalpallikar, a dot won't match newlines. The [\s\S] means "match anything that is not whitespace or anything that is whitespace". Logically, this means "match anything". It would work equally well with any character class and its negation. For example, [\d\D] would work too. – Ben Lee Feb 22 '12 at 09:55
  • @AmoghTalpallikar, I updated my answer to add this explanation more clearly. – Ben Lee Feb 22 '12 at 09:58
  • @BenLee: Thanks a lot, it worked brilliantly for me. and got to learn something new as well. Can u explain how does the "?" mark work for a non greedy match ? – Amogh Talpallikar Feb 22 '12 at 10:20
  • 2
    @AmoghTalpallikar, normally, doing ".*" Will match as many "."s as it can in a row. Doing ".*?" Will match *as few* "."s as it can while still making the match valid. So if you have a string "abc 0def1 0ghi1 jkl", then /0.*1/ will match "0def1 0ghi1" (longest match) whereas /0.*?1/ will match "0def1" (shortest match). – Ben Lee Feb 22 '12 at 10:35
  • @BenLee: thats right but the "?" chra mean's 0 or 1 match. then how does it work here ? – Amogh Talpallikar Feb 22 '12 at 10:47
  • @AmoghTalpallikar, usually `X?` means zero or one match of the previous character X. But when it follows a character that is itself something with a special meaning (like `*` or `+`) then it means "shortest match". So `X*?` means the shortest match of any number of X. Please read a regular expression doc if you are still confused. I've explained it best I can in the comments. – Ben Lee Feb 22 '12 at 16:57
1

For your specific problem, you could do something like this (example):

>[^<]+<
^  ^  ^
|  |__|__ End marker
|
Start marker

This will match everything between the start marker > and the end marker <, including new lines. Pick whichever start marker and end marker you prefer. If your end or start marker is several characters, just put it inside a parenthesis which should not be counted as a capture (?:yourmarkerhere).

string.replace(>[^<]+<, '')
ohaal
  • 5,208
  • 2
  • 34
  • 53
  • I forgot to add the last end marker, I also made it more obvious which is the start marker and which is the end marker. Try now. – ohaal Feb 22 '12 at 09:48
-1

Use an s modified at the end of your regex pattern. Adding 's' enables regex to match text containing line breaks.

e.g. '/patternhere/s'

Check here for more info http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php - should work for javascript also.

Imran Omar Bukhsh
  • 7,849
  • 12
  • 59
  • 81
-2

I think you should try the /m modifier.

just googled that: http://www.regular-expressions.info/modifiers.html and it says:

/m enables "multi-line mode". In this mode, the caret and dollar match before and after newlines in the subject string.

BiAiB
  • 12,932
  • 10
  • 43
  • 63
  • 1
    -1 I think you googled something and didn't understood. The `m` modifier does just change the behaviour of two anchors, but does not make a regex match over more than one line. – stema Feb 22 '12 at 09:37
  • @AmoghTalpallikar, this won't work. `multi-line mode` does not do what you want. But you can use a pseudo-dotall modifier like in my answer. – Ben Lee Feb 22 '12 at 09:39
  • It just worked. I tried an expression to remove multi-line comments on http://www.rubular.com/. with the pattern of / \/\*.*\*\/ /m and it worked for me. – Amogh Talpallikar Feb 22 '12 at 09:44
  • @amogh good :) also don't forget the /g modifier to match multiple times: `/\/*.**\//gm` – BiAiB Feb 22 '12 at 09:50
  • 1
    @AmoghTalpallikar your pattern is wrong. `*` is a special character in regex and needs escaping if you want to match it literally. – stema Feb 22 '12 at 09:53
  • 1
    @AmoghTalpallikar, try again with a larger piece of data. I can 100% gaurantee this will fail. Just read the quote in this answer. It explains exactly what the /m modifier does, and it's definitely not what you want. – Ben Lee Feb 22 '12 at 09:53
  • I know. it failed and matched everything. @BenLee's answer is correct one. – Amogh Talpallikar Feb 22 '12 at 10:33