0

I have run into another problem in relation to a site I am trying to scrape.

Basically I have stripped most of what I don't want from the page content and thanks to some help given here have managed to isolate the dates I wanted. Most of it seems to be working fine, despite some initial problems matching a non-breaking space. However, I am now having difficulty with the final regex, which is intended to split each line of data into fields. Each line represents the price of a share price index. The fields on each line are:

  1. A name of arbitrary length made from characters from the latin alphabet and sometimes a comma or ampersand, no numerics.
  2. A number with two digits after the decimal point (the absolute value of the index).
  3. A number with two digits after the decimal point (the change in the value).
  4. A number with two digits after the decimal point followed by a percent sign (the percentage change in value).

Here is an example string, before splitting: "Fishery, Agriculture & Forestry243.45-1.91-0.78% Mining360.74-4.15-1.14% Construction465.36-1.01-0.22% Foods783.2511.281.46% Textiles & Apparels412.070.540.13% Pulp & Paper333.31-0.29-0.09% Chemicals729.406.010.83% "

The regex I am using to split this line is this:

$mystr =~ s/\n(.*?)(\d{1,4}\.\d{2})(\-?\d{1,3}\.\d{2})(.*?%)\n/\n$1 == $2 == $3 == $4\n/ig;

It works sometimes but not other times and I cannot work out why this should be. (The doubled equal signs in the example output below are used to make the field split more easily visible.)

Fishery, Agriculture & Forestry == 243.45 == -1.91 == -0.78%
Mining360.74-4.15-1.14%
Construction == 465.36 == -1.01 == -0.22%
Foods783.2511.281.46%

I thought the minus sign was an issue for those indices that saw a negative change in the price of the index, but sometimes it works despite the minus sign.

Q. Why is the final regex shown below failing to split the fields consistently?

Example code follows.

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::Tree;

my $url_full = "http://www.tse.or.jp/english/market/STATISTICS/e06_past.html";

my $content = get($url_full);
# get dates:
(my @dates) = $content =~ /(?<=dateFormat\(')\d{4}\/\d{2}\/\d{2}(?='\))/g;
foreach my $date (@dates) { # convert to yyyy-mm-dd
    $date =~ s/\//-/ig;
}
my $tree = HTML::Tree->new();
$tree->parse($content);
my $mystr = $tree->as_text;

$mystr =~ s/\xA0//gi; # remove non-breaking spaces
# remove first chunk of text:
$mystr =~
  s/^(TSE.*?)IndustryIndexChange ?/IndustryIndexChange\n$dates[0]\n\n/gi;
$mystr =~ s/IndustryIndexChange ?/IndustryIndexChange/ig;
$mystr =~ s/IndustryIndexChange/Industry Index Change\n/ig;
$mystr =~ s/% /%\n/gi; # percent symbol is market for end of line
# indicate breaks between days:
$mystr =~ s/Stock.*?IndustryIndexChange/\nDAY DELIMITER\n/gi;
$mystr =~ s/Exemption from Liability.*$//g; # remove boilerplate at bottom

# and here's the problem regex...
# try to split it:
$mystr =~
  s/\n(.*?)(\d{1,4}\.\d{2})(\-?\d{1,3}\.\d{2})(.*?%)\n/\n$1 == $2 == $3 == $4\n/ig;

print $mystr;
Community
  • 1
  • 1
SlowLearner
  • 7,907
  • 11
  • 49
  • 80
  • I assume the number columns start out life with something between them. But by the time we get to extracting them, all the numbers are jammed together with only (hopefully) fixed formatting to help us tease them apart. Wouldn't things be a lot easier if you left the separating characters in? – zgpmax Feb 08 '12 at 12:47
  • A fair point. The figures were originally in 5 different tables so it was either try and parse/save each table or dump the text with HTML::Tree and I chose the latter. As the fields are quite regular I didn't think it would be a problem and in theory I still don't think it should be a problem. In practise, however... – SlowLearner Feb 08 '12 at 12:52

3 Answers3

2

It appears to be doing every other one.

My guess is that your records have a single \n between them, but your pattern starts and ends with a \n. So the final \n on the first match consumes the \n that the second match needed to find the second record. The net result is that it picks up every other record.

You might be better off wrapping your pattern in ^ and $ (instead of \n and \n), and using the m flag on the s///.

zgpmax
  • 2,777
  • 15
  • 22
  • Thank you, infuriatingly obvious now that I look at it. There must be some "law of regexes" that states that the likelihood of one solving a regex problem is inversely proportional to the time spent looking at the regex. – SlowLearner Feb 08 '12 at 13:04
2

The problem is that you have \n both at the start and at the end of the regex.

Consider something like this:

$s = 'abababa';
$s =~ s/aba/axa/g;

that will set $s to axabaxa, not axaxaxa, because there are only two non-overlapping occurrences of aba.

ruakh
  • 175,680
  • 26
  • 273
  • 307
0

My interpretation (pseudocode) -

one   = [a-zA-Z,& ]+
two   = \d{1,4}.\d\d
three = <<two>>
four  = <<two>>%

regex = (<<one>>)(<<two>>)(<<three>>)(<<four>>)
      = ([a-zA-Z,& ]+)(\d{1,4}.\d\d)(\d{1,4}.\d\d)(\d{1,4}.\d\d%)

However, you are already presented with 'structured' data in the form of HTML. Why not take advantage of this?

HTML parsing in perl references MOJO for DOM based parsing in perl, and unless there are serious performance reasons, I'd highly recommend such an approach.

Community
  • 1
  • 1
Lyndon Maydwell
  • 317
  • 2
  • 8