Regex help to parse dates

Question

I am trying to get important dates from string...

<tr> <td>Account Registered :</td> <td>2008-02-02</td></tr>
<tr> <td>Account Updated :</td> <td>2014-02-01</td></tr>
<tr> <td>Account Expires :</td> <td>2015-02-02</td></tr>

And I have tried following...

preg_match('#<tr> <td>Account Expires :</td> <td>[0-9]{4}-[0-9]{2}-[0-9]{2}#', $result, $matches);

And it provides following...

array (size=1)
  0 => string '<tr> <td>Account Expires :</td> <td>2015-02-02' (length=38)

I want to get all three dates in 1 regex or 3 different regex, please help me with this. thanks

Parse the HTML first, then pull the data from the HTML, then parse the data as `yyyy-mm-dd` format. — zzzzBov, Sep 22 '14 at 19:59
Put the date in a "capture group" (also use `preg_match_all` to get *all* matches). — gen_Eric, Sep 22 '14 at 19:59
You really cannot use regex to parse the DOM, but you can use PHP's DOM parser. — Jay Blanchard, Sep 22 '14 at 19:59

score 3 · Answer 1 · edited May 23 '17 at 12:21

3

You can use () to set capture groups which will be accessible in preg_match_all() (which does a global match, unlike preg_match()). Then you just need to not specify the verb Expires:

$result = '
<tr> <td>Account Registered :</td> <td>2008-02-02</td></tr>
<tr> <td>Account Updated :</td> <td>2014-02-01</td></tr>
<tr> <td>Account Expires :</td> <td>2015-02-02</td></tr>
';

if(preg_match_all('#<tr>\s*<td>Account\s*([^:]*?)\s*:</td>\s*<td>([0-9]{4}-[0-9]{2}-[0-9]{2})#', $result, $matches, PREG_SET_ORDER)) {
    print_r($matches);

    // Array
    // (
    //     [0] => Array
    //         (
    //             [0] => <tr> <td>Account Registered :</td> <td>2008-02-02
    //             [1] => Registered
    //             [2] => 2008-02-02
    //         )
    // 
    //     [1] => Array
    //         (
    //             [0] => <tr> <td>Account Updated :</td> <td>2014-02-01
    //             [1] => Updated
    //             [2] => 2014-02-01
    //         )
    // 
    //     [2] => Array
    //         (
    //             [0] => <tr> <td>Account Expires :</td> <td>2015-02-02
    //             [1] => Expires
    //             [2] => 2015-02-02
    //         )
    // )
}

But, you shouldn't rely on regex to parse HTML, since HTML isn't a regular language. A good exception to this "rule" is if your HTML is coming from your own code and you know you can simplify it to a "regular" expression for matching.

edited May 23 '17 at 12:21

Community

1
1

answered Sep 22 '14 at 20:02

Sam

20,096
2
45
71

wow that's a great answer, should i use php dom to parse dates? – seoppc Sep 22 '14 at 20:02
I'd suggest using PHP's [`DOMDocument`](http://php.net/manual/en/class.domdocument.php) to parse the HTML, yes. But if you want to use regex, this is the way you'd want to do it. If you want to manipulate the dates anymore, I suggest using [`new DateTime('2008-02-02')`](http://php.net/manual/en/class.datetime.php). – Sam Sep 22 '14 at 20:04
Looks like @hwnd beat me, enjoy! – Sam Sep 22 '14 at 20:10
Hello, there might be some requirments when i will not have Account as prefix, there might be First Registered, Last Updated and Expires, so how can i get those using Dom Parser – seoppc Sep 22 '14 at 20:21
If your regex is easy enough, like this one, why not use regex to 'parse' dom? This is much faster and readable and probably future proof. – Rudie Sep 22 '14 at 20:22
@seoppc just remove the `Account` from my expression to do it in regex. So your key will be captured with `([^:]+):`. – Sam Sep 22 '14 at 20:29
@Rudie, it's not necessarily readable or future proof that leads people to recommend against regex for HTML. It is the fact that HTML is not a regular language meaning a regular expression isn't built to understand the language. Using this in a real world HTML situation (parsing live websites) is not future proof and will need to be maintained for tweaks and unexpected formatting. However, if you are using an internal chunk of HTML and you know how it is formatted..this could be considered regular and many times regex is the perfect tool (which is why I answered the question). – Sam Sep 22 '14 at 20:30
@Sam I know why people (including me 90% of the time) recommend against using regex for html. It's just too bad that people now ALWAYS recommend against it, without looking at the use case. Looking for a "date within tags" is definitely more future proof than looking for specific tags with specific contents. – Rudie Sep 22 '14 at 20:38
@Rudie fair point, let me update how I worded my note since I'm not "ALWAYS" recommending against it (but does sound like it). I don't even answer questions that seem too complex for HTML and will comment on them instead. – Sam Sep 22 '14 at 20:40
1

@Sam I didn't mean you specifically in this case =) People that don't really know what they're talking about now scream "DON'T USE REGEX FOR HTML" because they heard it some time. DON'T SCREAM! THINK! Sorry, that's all =) You have all my upvotes for answering with regex! – Rudie Sep 22 '14 at 20:43

hwnd · Answer 2 · 2014-09-22T20:49:24.987

2

You can use a regular expression for something simple like this.

preg_match_all('/\b\d{4}-\d{2}-\d{2}\b/', $html, $matches);
print_r($matches[0]);

But I recommend using a parser such as DOM instead to extract these values.

// Load your HTML
$dom = DOMDocument::loadHTML('
     <tr> <td>foo bar</td> <td>123456789</td></tr>
     <tr> <td>Account Registered :</td> <td>2008-02-02</td></tr>
     <tr> <td>Account Updated :</td> <td>2014-02-01</td></tr>
     <tr> <td>Account Expires :</td> <td>2015-02-02</td></tr>
     <tr> <td>something else</td> <td>foo</td></tr>
');

$xp  = new DOMXPath($dom);
$tag = $xp->query('//tr/td[contains(.,"Account")]/following-sibling::*[1]');

foreach($tag as $t) { 
   echo $t->nodeValue . "\n";
}

// 2008-02-02
// 2014-02-01
// 2015-02-02

If you are unsure of the requirements for the prefix i.e (Account could change), simple fix would be to validate.

$xp  = new DOMXPath($dom);
$tag = $xp->query('//tr/td/following-sibling::*[1]');

foreach($tag as $t) { 
   $date = date_parse($t->nodeValue);
   if ($date["error_count"] == 0 && 
       checkdate($date["month"], $date["day"], $date["year"])) {
         echo $t->nodeValue . "\n";
   }
}

// 2008-02-02
// 2014-02-01
// 2015-02-02

edited Sep 22 '14 at 20:49

answered Sep 22 '14 at 20:07

hwnd

69,796
4
95
132

thanks i just needed an example using DomDocument, i will start using for future needs. thanks for your help. – seoppc Sep 22 '14 at 20:09
+1 for going outside of OP's request and doing a `DOMDocument` example :) – Sam Sep 22 '14 at 20:09
Hello, there might be some requirments when i will not have Account as prefix, there might be First Registered, Last Updated and Expires, so how can i get those using this code? – seoppc Sep 22 '14 at 20:16
dont't know why but i am getting this error... `Warning: DOMDocument::loadHTML(): Tag header invalid in Entity` – seoppc Sep 22 '14 at 20:27
Or if all headings have colons, this is the XPath I'd suggest: `//tr/td[contains(., ":")]/following-sibling::td[1]` – Sam Sep 22 '14 at 20:31
1

@Sam, I got a better one, check the update in a few seconds.. =) – hwnd Sep 22 '14 at 20:32
What if they change it to a `DL`? Like `TR > TD`, but now `DL > DD`. – Rudie Sep 22 '14 at 20:40

Rudie · Answer 3 · 2014-09-22T21:48:13.890

2

Simple regex for 'parsing' HTML is fine. It's probably faster and more future proof than using a DOM parser.

This one catches all 'dates within tags':

preg_match_all('#>(\d\d\d\d-\d\d-\d\d)<#', $html, $matches);
$dates = $matches[1];
print_r($dates);

Makes:

Array
(
    [0] => 2008-02-02
    [1] => 2014-02-01
    [2] => 2015-02-02
)

If there are more dates in $html and you only want those 3, forget this answer.

If you want to include times in the date(time)stamp, use this pattern:

#>(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d)<#

edited Sep 22 '14 at 21:48

answered Sep 22 '14 at 20:26

Rudie

52,220
42
131
173

Good eye on just matching the date, +1. Should work depending on OP's needs. – Sam Sep 22 '14 at 20:41
Hello, One more thing if you can help, how can it grab all dates with time, like `2015-09-21 01:02:26` – seoppc Sep 22 '14 at 21:11
Updated answer. Very easy. It's **really** worth learning basic regex, I promise. Things like `\d` and modifiers and capture groups and delimiters are very easy and very very valuable. – Rudie Sep 22 '14 at 21:49

Regex help to parse dates

3 Answers3