Need suggestion of a good way to find content of a div

Question

<div class="box notranslate" id="venueHours">
<h5 class="translate">Hours</h5>
<div class="status closed">Currently closed</div>
<div class="hours">
  <div class="timespan">
    <div class="openTime">
      <div class="days">Mon,Tue,Wed,Thu,Sat</div>
      <span class="hours"> 10:00 AM–6:00 PM</span>
    </div>
  </div>
  <div class="timespan">
    <div class="openTime">
      <div class="days">Fri</div>
      <span class="hours"> 10:00 AM–9:00 PM</span></div>
    </div>
    <div class="timespan">
      <div class="openTime">
        <div class="days">Sun</div>
        <span class="hours"> 10:00 AM–5:00 PM</span>
      </div>
    </div>
  </div>
</div>

I'm trying to capture the contents in all the <div class="days"> and <span class="hours">. I think I'm able to use regular expression in this task. But I also want to learn any funny or professional ways to capture the specific div blocks like this. Thanks.

Do you want the content *enclosed by* those tags or the content *within* those tags? — Kenosis, May 20 '12 at 15:20
Never parse XML/HTML/CSV files using regex. Use the existing modules, they are usually mature, stable and well tested. — dgw, May 20 '12 at 15:38

Joel Berger · Accepted Answer · 2012-05-21T12:05:31.700

In addition to the HTML parsing libraries mentioned elsewhere, other modules have DOM capability too. See for example Web::Query and Mojolicious' Mojo::DOM.

Here is an example using Mojo::DOM and CSS3 selectors:

#!/usr/bin/env perl

use strict;
use warnings;

use 5.10.0;
use Mojo::DOM;

my $dom = Mojo::DOM->new(<<'HTML');
<div class="box notranslate" id="venueHours">
<h5 class="translate">Hours</h5>
<div class="status closed">Currently closed</div>
<div class="hours">
  <div class="timespan">
    <div class="openTime">
      <div class="days">Mon,Tue,Wed,Thu,Sat</div>
      <span class="hours"> 10:00 AM–6:00 PM</span>
    </div>
  </div>
  <div class="timespan">
    <div class="openTime">
      <div class="days">Fri</div>
      <span class="hours"> 10:00 AM–9:00 PM</span></div>
    </div>
    <div class="timespan">
      <div class="openTime">
        <div class="days">Sun</div>
        <span class="hours"> 10:00 AM–5:00 PM</span>
      </div>
    </div>
  </div>
</div>
HTML

say "div days:";
say $_->text for $dom->find('div.days')->each;

say "\nspan hours:";
say $_->text for $dom->find('span.hours')->each;

Or equivalently:

say "div days:";
say for $dom->find('div.days')->map(sub{$_->text})->each;

say "\nspan hours:";
say for $dom->find('span.hours')->map(sub{$_->text})->each;

Output:

div days:
Mon,Tue,Wed,Thu,Sat
Fri
Sun

span hours:
 10:00 AM–6:00 PM
 10:00 AM–9:00 PM
 10:00 AM–5:00 PM

Or to get the times corresponding to the days, you can use the children of the openTimes div:

say "Open Times:";
say for $dom->find('div.openTime')
            ->map(sub{$_->children->each})
            ->map(sub{$_->text})
            ->each;

Output:

Open Times:
Mon,Tue,Wed,Thu,Sat
 10:00 AM–6:00 PM
Fri
 10:00 AM–9:00 PM
Sun
 10:00 AM–5:00 PM

Edit: Daxim has posted the analogous Web::Query code as a comment, so I will repost it here for better formatting. I haven't tried it, but I trust his code generally. Assuming the HTML is in a variable $html:

use Web::Query qw(); 
my $w = Web::Query->new_from_html($html);
say "div days:";
say for $w->find('div.days')->text; 
say "\nspan hours:"; 
say for $w->find('span.hours')->text; 
say "Open Times:"; 
$w->find('div.openTime')->each(sub { say for $_->find('*')->text });

For comparison, Web::Query's API is less verbose: `use Web::Query qw(); my $w = Web::Query->new_from_html($html); say for $w->find('div.days')->text; say "\nspan hours:"; say for $w->find('span.hours')->text; say "Open Times:"; $w->find('div.openTime')->each(sub { say for $_->find('*')->text });` — daxim, May 20 '12 at 21:18

score 3 · Answer 2 · edited May 20 '12 at 21:10

3

Use modules specific to this task: HTML::Parser, HTML::Tree and the like.

edited May 20 '12 at 21:10

daxim

39,270
4
65
132

answered May 20 '12 at 14:39

Oleg V. Volkov

21,719
4
44
68

score -1 · Answer 3 · answered May 20 '12 at 14:40

-1

regular expression to match Status "Currently closed":

/<\/h5><div[^>]*>([^<]*)/

to match days:

/<div class="days">([^<]*)/

to match hours:

/<span class="hours">([^<]*)/

answered May 20 '12 at 14:40

Sergej Brazdeikis

1,323
10
11

1

why you downvote my solution? It is faster then using DOM libraries. Just check memory usage and time the script executes. – Sergej Brazdeikis May 20 '12 at 21:22
1

LOL I didn't downvote anyone, I know this will work. Good job and thanks. – Ivan Wang May 21 '12 at 01:23
1

Maybe because question specifically mentions that he's aware about regexp and want to hear about OTHER options? – Oleg V. Volkov May 21 '12 at 09:12
1

Downvoted for violation of http://stackoverflow.com/a/1732454/9719 . Regexes are a bad solution. – darch May 21 '12 at 13:13
that post has no valid reasons explained. "very time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers..." What???? :D Ok, if this sounds like a reason, read that -> "Chuck Norris can parse HTML with regex." – Sergej Brazdeikis May 21 '12 at 18:31
thanks Oleg, I got the point, the answer coundn't be in regexp. My bad :) – Sergej Brazdeikis May 21 '12 at 18:44

Need suggestion of a good way to find content of a div

3 Answers3