48

I have a string that contains normal characters, white charsets and newline characters between <div> and </div>.

This regular expression doesn't work: /<div>(.*)<\/div>. It is because .* doesn't match newline characters. How can I do this?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131

6 Answers6

58

You need to use the DOTALL modifier (/s).

'/<div>(.*)<\/div>/s'

This might not give you exactly what you want because you are greedy matching. You might instead try a non-greedy match:

'/<div>(.*?)<\/div>/s'

You could also solve this by matching everything except '<' if there aren't other tags:

'/<div>([^<]*)<\/div>/'

Another observation is that you don't need to use / as your regular expression delimiters. Using another character means that you don't have to escape the / in </div>, improving readability. This applies to all the above regular expressions. Here's it would look if you use '#' instead of '/':

'#<div>([^<]*)</div>#'

However all these solutions can fail due to nested divs, extra whitespace, HTML comments and various other things. HTML is too complicated to parse with Regex, so you should consider using an HTML parser instead.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • Can I ask you why there is [^<] in '/
    [^<]*<\/div>/m'? I know what it means but I don't understand why you are using it. I think that it can cause problems for example with
    some bold text
    – Gaim Dec 31 '09 at 16:10
  • +1 for the regex vs. HTML insight. If you need to work with HTML, you probably want some sort of DOM. – Williham Totland Dec 31 '09 at 16:16
  • Ok, I know that it is subjective but I think that problems only with `
    a
    b
    c
    ` are lesser evil than problems with all nested tags. Btw I think that you are missing `/m` in the last expression because it will be still only single-line.
    – Gaim Dec 31 '09 at 16:21
  • 1
    `/s` is the modifier that lets the dot match newlines (single-line or DOTALL mode); `/m` changes the behavior of `^` and `$` (multiline mode). Unless you're working with Ruby, where multiline mode is always on and `/m` turns on DOTALL mode. Or JavaScript, which has no DOTALL mode. – Alan Moore Dec 31 '09 at 16:39
  • @Mark, if you're using the negated character class `[^<]*`, which may be a good idea depending on the nature of the subject string, I'd make it possessive to prevent possible needless backtracking: `#
    ([^<]*+)
    #`.
    – Geert Jan 01 '10 at 06:35
  • DOTALL modifier!? like WTF, who the h..ll ever came with this crap, it's ridiculous, stupid php – Eugene Kuzmenko Jan 17 '13 at 19:06
19

To match all characters, you can use this trick:

%\<div\>([\s\S]*)\</div\>%
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Hossein
  • 219
  • 2
  • 8
  • 6
    That's a hack to work around JavaScript's lack of a DOTALL/single-line mode; it isn't needed in PHP. Also, `<` and `>` have no special meanings, so you don't have to escape them. – Alan Moore Jul 29 '12 at 11:26
  • @Alan Moore It may be a hack, but it's a good trick to note in general, because you don't need to worry about mode support or which symbolic tokens to escape. Plus you may not want to change the mode for the whole regex. – Beejor Sep 05 '17 at 22:05
  • 1
    @AlanMoore In case anyone stumbles on this, JavaScript does have an "s" flag (maybe it didn't in 2012) for RegExp to provide DOTALL behavior. https://javascript.info/regexp-introduction#flags – spex Dec 25 '20 at 07:11
  • But this question is tagged with "PHP". Does it work in other than JavaScript? Does it actually work with newline characters in PHP? User *Hossein* has left the building, so others need to chime in. – Peter Mortensen Nov 18 '21 at 18:07
17

You can also use the (?s) mode modifier. For example,

(?s)/<div>(.*?)<\/div>
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
acarlon
  • 16,764
  • 7
  • 75
  • 94
5

There shouldn't be any problem with just doing:

(.|\n)

This matches either any character except newline or a newline, so every character. It solved it for me, at least.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
DavidsKanal
  • 655
  • 11
  • 15
1

An option would be:

'/<div>(\n*|.*)<\/div>/i'

Which would match either newline or the dot identifier matches.

MillerMedia
  • 3,651
  • 17
  • 71
  • 150
-2

There is usually a flag in the regular expression compiler to tell it that dot should match newline characters.

pau.estalella
  • 2,197
  • 1
  • 15
  • 20
  • The suspense! Can you reveal what the flag is? Please respond by [editing (changing) your answer](https://stackoverflow.com/posts/1985952/edit), not here in comments (***without*** "Edit:", "Update:", or similar - the answer should appear as if it was written today). – Peter Mortensen Nov 18 '21 at 18:04