1

I am trying to extract the content of a date element from many ill-formed sgml documents. For instance, the document can contain a simple date element like

<DATE>4th July 1936</DATE>

or

<DATE blaAttrib="89787adjd98d9">4th July 1936</DATE>

but can also as hairy as:

<DATE blaAttrib="89787adjd98d9">4th July 1936
<EM>spanned across multiple lines and EM element inside DATE</EM></DATE>

The aim is to get the "4th July 1936". Since the files are not big, I chose to read the whole content into a variable and do the regex. The following is the snippet of my Perl code:

{
    local $/ = undef;
    open FILE, "$file" or die "Couldn't open file: $!";
    $fileContent = <FILE>;
    close FILE;

    if ( $fileContent =~ m/<DATE(.*)>(.*)<\/DATE>/)
    {
        # $2 should contain the "4th July 1936" but it did not.
    }
}

Unfortunately the regex does not work for the hairy example. This is because inside the <DATE> there is an <EM> element and it also spans multiple lines.

Can any kind soul give me some pointers, directions, or clues?

Thanks heaps!

Svante
  • 50,694
  • 11
  • 78
  • 122
Gilbeg
  • 741
  • 2
  • 9
  • 19
  • 1
    [Friends don't let friends parse HTML with regexes!](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) Use a parser. – Ether Jul 27 '10 at 14:56
  • [Don't Parse HTML with Regexs use HTML::Parser](http://perldoc.perl.org/perlfaq6.html#How-do-I-match-XML%2c-HTML%2c-or-other-nasty%2c-ugly-things-with-a-regex%3f) Also use [3 argument open and lexical filehandles](http://perldoc.perl.org/functions/open.html) – xenoterracide Jul 27 '10 at 17:56
  • also queston... you say ill formed... do you mean /not/ well-formed? meaning something like happens? – xenoterracide Jul 27 '10 at 17:58

7 Answers7

4

Use an XML parser if you can.

But from your example, probably you could try

if ($fileContent =~ m/<DATE[^>]*>([^<]+)/) {
  # use $1 here
  # you may need to strip new lines
}
Community
  • 1
  • 1
kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005
  • 1
    Hi Ken. Thanks for the regex, certainly worked. The reason I did not use any XML Parser is because there are about 20,000 SGML files I need to check. Their size about 50K each. If I have to parse them I think it is an overkill and will be slow. I might be able to use sax based parser but I am not a Perl expert so just try to do this task asap and move on. – Gilbeg Jul 28 '10 at 00:36
3

Use an HTML parser.

Use an HTML parser.

Please, use an HTML parser.

But for a regex, I'd try

<DATE(.*?)>(.*)<\/DATE>

which should be faster than KennyTM's alternative... By the way, why are you capturing that second group?

Ether
  • 53,118
  • 13
  • 86
  • 159
MvanGeest
  • 9,536
  • 4
  • 41
  • 41
  • Ah, I hadn't noticed that. Still, there are some very resilient parsers around that can handle a huge mess. – MvanGeest Jul 27 '10 at 14:05
  • There are HTML parsers that would do this job nicely. – Ether Jul 27 '10 at 14:56
  • Thanks Ether. I'm still getting used to the idea that users can edit my answers, though. (I knew that when I signed up, but I always wondered how often it would happen, and why. Well, here's a legitimate reason.) – MvanGeest Jul 27 '10 at 17:15
  • I have about 20K of SGML files, I just want to check their dates. If I have to parse them say using SGML::Parser then it would be an overkill and slow. Unless I am using SAX based parser. BTW, your regex indeed worked. Thanks! – Gilbeg Jul 28 '10 at 00:40
3

If the date format is fixed, you might want to use something like this:

m/<DATE(.*)>([0-9]+(st|nd|rd|th)\s(January|February|March|April|May|June|July|August|September|October|November|December)\s[0-9]+)(.*)<\/DATE>/
Karel Petranek
  • 15,005
  • 4
  • 44
  • 68
3

instead of matching .*, you should match "everything that is not an anchor"

ie :


 if($string =~ /^<DATE[^>]*>([^<]+)</){

there, $1 is your date

benzebuth
  • 695
  • 3
  • 11
2

You should use non greedy matching and the modifier s to make . match newline

my @l = (
'<DATE>4th July 1936</DATE>',
'<DATE blaAttrib="89787adjd98d9">4th July 1936</DATE>',
'<DATE blaAttrib="89787adjd98d9">4th July 1936
<EM>spanned across multiple lines and EM element inside DATE</EM></DATE>'
);

foreach(@l) {
  /^<DATE.*?>(.*?)</s && print $1;
}

output:

4th July 1936
4th July 1936
4th July 1936
Toto
  • 89,455
  • 62
  • 89
  • 125
0

Even your "hairy" example can be reduced to a similar type. If you are always going to have 1) the actual date on the same line as the start tag--and 2) that's all you want--it doesn't matter where the end tag is.

$fileContent =~ m/<DATE([^>]*)>\s*(\d+\p{Alpha}+\s+\p{Alpha}+\s+\d{4})/

is always going to work. (If you're not going to find '>' in the tag, then it's a good idea to not cause so much backtracking after .* eats up your entire line, causes the expression to fail and then has to give back and check, give back and check, ...)

Axeman
  • 29,660
  • 2
  • 47
  • 102
-4

There is not any way to use regex over multiple lines, but you can use a little trick. If files aren't to big, as you have mentioned, you can first replace all '\n' characters with some value (NEW_LINE or something like that), or you can delete them and then use your pattern.

Klark
  • 8,162
  • 3
  • 37
  • 61
  • 4
    There is. He's doing `local $/ = undef;` which does just that (well, it reads the whole file at once). Read up on Perl regexes in `perldoc perlre`. – MvanGeest Jul 27 '10 at 13:13