2

I'm having difficulty scraping dates from a specific web page because the date is apparently an argument passed to a javascript function. I have in the past written a few simple scrapers without any major issues so I didn't expect problems but I am struggling with this. The page has 5-6 dates in regular yyyy/mm/dd format like this dateFormat('2012/02/07')

Ideally I would like to remove everything except the half-dozen dates, which I want to save in an array. At this point, I can't even successfully get one date, let alone all of them. It is probably just a malformed regex that I have been looking it so long that I can't spot any more.

Q1. Why am I not getting a match with the regex below?

Q2. Following on from the above question how can I scrape all the dates into an array? I was thinking of assuming x number of dates on the page, for-looping x times and assigning the captured group to an array each loop, but that seems rather clunky.

Problem code follows.

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::Tree;

my $url_full = "http://www.tse.or.jp/english/market/STATISTICS/e06_past.html";
my $content = get($url_full);
#dateFormat('2012/02/07');
$content =~ s/.*dateFormat\('(\d{4}\/\d{2}\/\d{2}\s{2})'\);.*/$1/; # get any date without regard to greediness etc
Mat
  • 202,337
  • 40
  • 393
  • 406
SlowLearner
  • 7,907
  • 11
  • 49
  • 80

1 Answers1

3

Why do you have two whitespace characters in your pattern?

$content =~ s/.*dateFormat\('(\d{4}\/\d{2}\/\d{2}\s{2})'\);.*/$1/;
                                                 ^^^^^

they are not in your format example 'dateFormat('2012/02/07')'

I would say this is the reason why your pattern does not match.

Capture all dates

You can simply get all matches into an array like this

( my @Result ) = $content =~ /(?<=dateFormat\(')\d{4}\/\d{2}\/\d{2}(?='\))/g;

(?<=dateFormat\(') is a positive lookbehind assertion that ensures that there is dateFormat\(' before your date pattern (but this is not included in your match)

(?='\)) is a positive lookahead assertion that ensures that there is '\) after the pattern

The g modifier let your pattern search for all matches in the string.

stema
  • 90,351
  • 20
  • 107
  • 135
  • Thanks, that's certainly part of it - I have been staring at that for half an hour without noticing the remains of an earlier experiment! But even without that whitespace, `s/.*dateFormat\('(\d{4}\/\d{2}\/\d{2})'\);.*/$1/` doesn't capture what I want. – SlowLearner Feb 08 '12 at 09:00
  • @SlowLearner I completed my answer – stema Feb 08 '12 at 09:06
  • Thank you: it works and it's very simple once you know how (as these things so often are with perl...) I was aware of the existence of the look ahead/behind operators but I had never used them before today. So that's progress. – SlowLearner Feb 08 '12 at 09:42