35

How can I match subject via a PHP preg_match() regular expression pattern in this HTML code:

      <table border=0>
  <tr>
  <td>


  <h2>subject</h2>



    </td>

All the whitespaces and newlines are left on purpose. So the problem is in extracting subject name using some multiple line pattern.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Dmitriy Ryabinin
  • 353
  • 1
  • 3
  • 4
  • This article may useful [multiline-searches-with-preg_match-in](https://blog-en.openalfa.com/multiline-searches-with-preg_match-in-php) – Ali Yousefi May 04 '20 at 04:33

6 Answers6

67

If you're looking for (e.g.) a h2 tag nested within a td tag where there's only whitespace in between the two, just use \s which includes spaces, newlines, etc. eg::

preg_match('#<td>\s*<h2>(.*?)</h2>\s*</td>#i',$str,$matches);
// result is in $matches[1]

See it in action here.

For your interest, here is a list of different modifiers you can pass in to preg_* functions. Flags that may interest you are:

  • s ("dotall") : this one makes . match every character, including newlines. So, say your <h2>.....</h2> was spread over multiple lines. Then you'd have to do

    preg_match('#<td>\s*<h2>(.*?)</h2>\s*</td>#is',$str,$matches);
    

    in order to have the .* go over multiple lines (see the extra s at the end of the regex?).

  • m ("multiline") : this one just lets ^ and $ match start/end of line instead of just the start/end of string. You only really need it if you're using ^ and $ in your pattern and want them to match the start/end of each individual line in your input.
mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
15

You can add the m operator to your regular expression:

// Given your HTML content.
$html = 'Your HTML content';
preg_match('/<td[^>]*>(.*?)<\/td>/im', $html, $matches);

Hope this (still) helps, hahaha.

Saul Martínez
  • 920
  • 13
  • 28
4

You shouldn't use regex to parse HTML content. It can cause a lot of issues if you cannot control what the user can input. There are a lot of better solutions in every language. An XML parser in most of the cases is doing a better job. Check out DOMDocument, simplehtmldom or php-html-parser

See here for more answers why you shouldn't use regex on HTML content: RegEx match open tags except XHTML self-contained tags

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Maciej Paprocki
  • 1,230
  • 20
  • 29
  • 1
    I was looking for this answer. I was surprised that 5 years later nobody suggested that maybe it's a bad idea to parse html with regex. Don't understand why it's downvoted. – s3v3n Dec 06 '16 at 14:34
  • Yep, welcome to the club. I still stand by my answer, though :) – Maciej Paprocki Dec 06 '16 at 16:03
  • 1
    This is definitely the way to approach this. Gave it another upvote at least :-) – Marty Jan 07 '17 at 01:56
  • 4
    I haven’t voted on this, but I might add that it misses the point of the question, which is how to use `preg_match` with multiple lines. It is _not_ answering the question if you don’t like the use case. – Manngo Sep 13 '19 at 02:10
  • Hmm. I think I am offering better solutions than one provided. If someone uses the wrong tool shouldn't I tell them they do and offer better alternatives? – Maciej Paprocki Sep 16 '19 at 14:58
3

Very simply with

preg_match('/<h2>(.*?)<\\/h2>/', $str, $matches);
print($matches[1]);

The multi-line format has no effect on the regex unless you need to match a string that spans multiple lines.

Borodin
  • 126,100
  • 9
  • 70
  • 144
  • Sorry I should have been more specific. The problem is in the lack of "identifiers" in the HTML code i am dealing with. There can be some other h2 tags and others. So i am trying to use the surrounding tags to exactly target this particular place in the code. So how can i make regex patterns understand multilines?... – Dmitriy Ryabinin Jan 22 '12 at 02:31
0

Catch a block of code separated by 4 four backticks (as the markdown syntax).

Example to be adapted easily.

<?php

$str = '
# Some Text

```` 
    h5 {
      font-size: 1rem;
      font-weight: 600;
    }
````

And some text.
';

$reg = '/````[^>]*(.*?)````/';

preg_match($reg, $str, $matches);
echo $matches[0];

/* OUTPUT
```` 
    h5 {
      font-size: 1rem;
      font-weight: 600;
    }
````
*/

echo preg_replace($reg, "DELETED", $str);

/* OUTPUT
# Some Text

DELETED

And some text.
*/
NVRM
  • 11,480
  • 1
  • 88
  • 87
-5

You have to remove all line breaks using \s in the regular expression:

$str ="<ol>
         <li>Capable for unlimited product</li>
         <li>Two currency support</li>
         <li>Works with touch screens and click screen based systems</li>
         <li>Responsive design <b>shopping cart</b>, Specially design for Mac, iPhone, iPad, PC and Android</li>
         <li>VAT for countries that support a Value Added Tax</li>
         <li>Barcode scanner checkout option for POS</li>
         <li>mRSS</li>
       </ol>";

preg_match("/^([A-Za-z0-9\s\<\>\.\,\/\-\ ]+)$/", $str);

// Sanitize your code before save to database.

function test_input($data) {
    $data = trim($data);
    $data = htmlspecialchars($data);
    $data = json_encode($data);
    $data = addslashes($data);
    return $data;
}

echo test_input($str);
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131