Perl RegEx: Limiting the pattern to only the first occurrence of a character

Question

I am trying to extract the content of a date element from many ill-formed sgml documents. For instance, the document can contain a simple date element like

<DATE>4th July 1936</DATE>

or

<DATE blaAttrib="89787adjd98d9">4th July 1936</DATE>

but can also as hairy as:

<DATE blaAttrib="89787adjd98d9">4th July 1936
<EM>spanned across multiple lines and EM element inside DATE</EM></DATE>

The aim is to get the "4th July 1936". Since the files are not big, I chose to read the whole content into a variable and do the regex. The following is the snippet of my Perl code:

{
    local $/ = undef;
    open FILE, "$file" or die "Couldn't open file: $!";
    $fileContent = <FILE>;
    close FILE;

    if ( $fileContent =~ m/<DATE(.*)>(.*)<\/DATE>/)
    {
        # $2 should contain the "4th July 1936" but it did not.
    }
}

Unfortunately the regex does not work for the hairy example. This is because inside the <DATE> there is an <EM> element and it also spans multiple lines.

Can any kind soul give me some pointers, directions, or clues?

Thanks heaps!

[Friends don't let friends parse HTML with regexes!](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) Use a parser. — Ether, Jul 27 '10 at 14:56
[Don't Parse HTML with Regexs use HTML::Parser](http://perldoc.perl.org/perlfaq6.html#How-do-I-match-XML%2c-HTML%2c-or-other-nasty%2c-ugly-things-with-a-regex%3f) Also use [3 argument open and lexical filehandles](http://perldoc.perl.org/functions/open.html) — xenoterracide, Jul 27 '10 at 17:56
also queston... you say ill formed... do you mean /not/ well-formed? meaning something like happens? — xenoterracide, Jul 27 '10 at 17:58

score 4 · Accepted Answer · edited May 23 '17 at 11:47

4

Use an XML parser if you can.

But from your example, probably you could try

if ($fileContent =~ m/<DATE[^>]*>([^<]+)/) {
  # use $1 here
  # you may need to strip new lines
}

edited May 23 '17 at 11:47

Community

1
1

answered Jul 27 '10 at 13:07

kennytm

510,854
105
1,084
1,005

1

Hi Ken. Thanks for the regex, certainly worked. The reason I did not use any XML Parser is because there are about 20,000 SGML files I need to check. Their size about 50K each. If I have to parse them I think it is an overkill and will be slow. I might be able to use sax based parser but I am not a Perl expert so just try to do this task asap and move on. – Gilbeg Jul 28 '10 at 00:36

score 3 · Answer 2 · edited Jul 27 '10 at 16:06

3

Use an HTML parser.

Please, use an HTML parser.

But for a regex, I'd try

<DATE(.*?)>(.*)<\/DATE>

which should be faster than KennyTM's alternative... By the way, why are you capturing that second group?

edited Jul 27 '10 at 16:06

Ether

53,118
13
86
159

answered Jul 27 '10 at 13:12

MvanGeest

9,536
4
41
41

Ah, I hadn't noticed that. Still, there are some very resilient parsers around that can handle a huge mess. – MvanGeest Jul 27 '10 at 14:05
There are HTML parsers that would do this job nicely. – Ether Jul 27 '10 at 14:56
Thanks Ether. I'm still getting used to the idea that users can edit my answers, though. (I knew that when I signed up, but I always wondered how often it would happen, and why. Well, here's a legitimate reason.) – MvanGeest Jul 27 '10 at 17:15
I have about 20K of SGML files, I just want to check their dates. If I have to parse them say using SGML::Parser then it would be an overkill and slow. Unless I am using SAX based parser. BTW, your regex indeed worked. Thanks! – Gilbeg Jul 28 '10 at 00:40

score 3 · Answer 3 · answered Jul 27 '10 at 13:12

3

If the date format is fixed, you might want to use something like this:

m/<DATE(.*)>([0-9]+(st|nd|rd|th)\s(January|February|March|April|May|June|July|August|September|October|November|December)\s[0-9]+)(.*)<\/DATE>/

answered Jul 27 '10 at 13:12

Karel Petranek

15,005
4
44
68

benzebuth · Answer 4 · 2010-07-27T13:37:38.387

3

instead of matching .*, you should match "everything that is not an anchor"

ie :


 if($string =~ /^<DATE[^>]*>([^<]+)</){

there, $1 is your date

edited Jul 27 '10 at 13:37

answered Jul 27 '10 at 13:29

benzebuth

695
3
11

score 2 · Answer 5 · answered Jul 27 '10 at 13:57

You should use non greedy matching and the modifier s to make . match newline

my @l = (
'<DATE>4th July 1936</DATE>',
'<DATE blaAttrib="89787adjd98d9">4th July 1936</DATE>',
'<DATE blaAttrib="89787adjd98d9">4th July 1936
<EM>spanned across multiple lines and EM element inside DATE</EM></DATE>'
);

foreach(@l) {
  /^<DATE.*?>(.*?)</s && print $1;
}

output:

4th July 1936
4th July 1936
4th July 1936

score 0 · Answer 6 · answered Jul 27 '10 at 18:45

Even your "hairy" example can be reduced to a similar type. If you are always going to have 1) the actual date on the same line as the start tag--and 2) that's all you want--it doesn't matter where the end tag is.

$fileContent =~ m/<DATE([^>]*)>\s*(\d+\p{Alpha}+\s+\p{Alpha}+\s+\d{4})/

is always going to work. (If you're not going to find '>' in the tag, then it's a good idea to not cause so much backtracking after .* eats up your entire line, causes the expression to fail and then has to give back and check, give back and check, ...)

score -4 · Answer 7 · answered Jul 27 '10 at 13:12

-4

There is not any way to use regex over multiple lines, but you can use a little trick. If files aren't to big, as you have mentioned, you can first replace all '\n' characters with some value (NEW_LINE or something like that), or you can delete them and then use your pattern.

answered Jul 27 '10 at 13:12

Klark

8,162
3
37
61

4

There is. He's doing `local $/ = undef;` which does just that (well, it reads the whole file at once). Read up on Perl regexes in `perldoc perlre`. – MvanGeest Jul 27 '10 at 13:13

Perl RegEx: Limiting the pattern to only the first occurrence of a character

7 Answers7