0

I am having real problems trying to extract the text between a HTML header tag. I have the following Perl script which I am using to test:

#!/usr/bin/perl

my $text = '<html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-    micr=osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word" =xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" xmlns=3D"http:=//www.w3.org  /TR/REC-html40"><head><META HTTP-EQUIV=3D"Content-Type" CONTENT==3D"text/html; charset=3Dus-ascii"><meta name=3DGenerator content=3D"Micros=oft Word 14 (filtered medium)">This is a test</HTML>';

my $html = "Add this first";
$text =~ /(<html .*>)(.*)/i;
print $text . "\n";

What I need to achieve is that the text between between the is extracted into $1 and what is left into $2. Then I can add in my text using print $1$myhtml$2

I just cannot get it to work :(

UxBoD
  • 35
  • 1
  • 7
  • 4
    [Don't](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) try to parse HTML with regex. – ObscureRobot Oct 23 '11 at 14:54

2 Answers2

4

Rather than using .* which will match the closing > as well, try [^>]* which matches anything but a closing >

However, in general regex is not the right way to parse HTML. It just doesn't work. There are so many variations in the way that HTML is written that you'll come up against a ridiculous number of problems.

The real solution is to parse the DOM tree and find what you want that way. Try using an XML parser.

Nick Brunt
  • 9,533
  • 10
  • 54
  • 83
0
if ($subject =~ m!<html[^>]*>(.*?)</html>!) {
    $result = $1;
}

Things to note. Your input starts with html and ends with HTML.. This cannot be.

Also if this is the ONLY tag you are considering extracting the you can use regex. However if you want to extract specific tags from inside the html/xhtml/xml etc. you should consider using one of the countless modules that are written for this job.

FailedDev
  • 26,680
  • 9
  • 53
  • 73
  • Perhaps I did not explain correctly. Here is a simplified example: Some Text. I need to be able to insert my own text before "Some Text" without affecting anything else. Hence I though I could grab the start text into $1 and everything after that into $2. Then when I output the line I would use print $1$mytext$2 which would keep everything in tact. – UxBoD Oct 23 '11 at 15:37
  • If you want to insert something into something else use lookaheads/lookbehinds to poinpoint the exact location where you want to insert your text. No need to capture and copy everything into replace. – FailedDev Oct 23 '11 at 15:49
  • ..

    is perfectly legal HTML.

    – tadmc Oct 23 '11 at 20:02
  • @tadmc Even if it is, it's bad practice. – FailedDev Oct 23 '11 at 20:14
  • @FailedDev thanks the lookaheads/behinds appear to be the best way of achieving this. – UxBoD Oct 24 '11 at 07:50