Update 2020:
I can write more clearly what they are, and a way to remember them, and I am writing it as related to JavaScript:
- traditionally, JS regex has no
s
flag. It only has the m
flag. As of January 2020, Firefox still doesn't have it and Chrome has it. And NodeJS has it. It is in the ES2018 specs.
- The
s
is also called dotall
or singleline
. And it really is just for the .
to match any (ASCII) character, including \n
, \r
, \u2028
(line break), \u2029
(paragraph break). When people ask you, what does .
match? And if you answer "any character", then it is not entirely correct. It is all (ASCII) characters except the newline character, \r
and the unicode line break and paragraph break. For it to match really all ASCII characters, it needs to have the s
flag on.
- To overcome the missing of
s
flag in Firefox or in any platform, it can be [^]
, [\s\S]
, [\d\D]
, etc, or (.|\s)
.
- That's all. That's about the
s
flag that is missing in traditional JavaScript.
- Now the
m
flag. It stands for multiline. And it really is very simple: Without the m
flag, the ^
and $
will match the beginning and end of the whole string only. So "John Doe\nMary Lee".match(/^John Doe$/)
will not match, and "John Doe\nMary Lee".match(/^John Doe$/m)
will match. That's all. Don't think about it in a too complicated way. It just changes how ^
and $
will match.
- So is "singleine" and "multiline" mutually exclusive? No, they are not. For example, if I want to match
a
and then whatever characters including newline, and f
, but a
must be at the beginning of a line and f
must be at the end of line, even if out of 2000 lines of text, then "a b c \n d e f\nha".match(/^a.*f$/ms)
is what needs to be used. Both .
matching \n
, and ^
and $
matching beginning of line and end of line.
That's it. The above was tested on NodeJS and Chrome, which already supports the s
flag. (and the m
flag has long been supported). And remember, you can always fix the s
flag missing issue by using [^]
Now, why was ms
or ism
being used a lot in the past? Because a lot of times, when we have a really long string (e.g. 2000 lines of HTML), such as in the case of some web content we get back, we rarely want to match the ^
with beginning of the entire string and $
with the end of the entire string. So that's why we use the m
flag. Now, we probably want to match across lines, because (although not recommended to use regex for matching HTML), we may use /<h1>.*?</h1>/
for a non-greedy match of a header, for example. We don't mind the \n
in the content, because the author of the HTML can very well have a \n
(or not). So that's why we use the "dotall" flag s
.
But if you are trying to extract some info from a webpage, you probably won't care about if something is at the beginning of line or end of line (because HTML files can have spaces in them (or as indentation), and it doesn't affect the page content (usually, unless if there is <pre>
etc)), so you won't need to use ^
or $
, and therefore you can forget about the m
flag. And if you don't mind using [^]*?
instead of .*?
, then you can forget about the s
flag too -- end of story.
Perl Cookbook said it in two sentences:
The difference between /m
and /s
is important: /m
makes ^
and $
match next to a newline, while /s
makes .
match newlines. You can even use them together - they're not mutually exclusive options.
maybe this way, i will never forget:
when i want to match across lines (usually using .*? to match something that doesn't matter if it span across multiple line), i will naturally think of multiline, and therefore, 'm'. Well, 'm' is actually not the one, so it is 's'.
(since i already remember 'ism' so well... so i can always remember it is not 'm', then it must be 's').
other lame attempt includes:
s
is for DOTALL, it is for DOT to match ALL.
m
is multiline -- it is for ^
and $
to match a lot of times.