0

I have various HTML documents that I'm trying to extract the links to: (1) other html documents, (2) image files such as .jpg, .png and .bmp. I need a regular expression to do this and cannot seem to figure it out.

Each of the html pages will have code similar to the following:


IMG style="MARGIN-BOTTOM: 20px; MARGIN-LEFT: 20px" align=right src="images/sample001.jpg">

IMG style="MARGIN-BOTTOM: 25px; MARGIN-LEFT: 25px" align=right src="images/sample002.png">

IMG style="MARGIN-BOTTOM: 20px; MARGIN-LEFT: 20px" align=right src="images/sample003.bmp">

href="javascript:parent.POPUP({url:'testDoc001.htm',type:'shared',width:600,height:645})">

href="javascript:parent.POPUP({url:'testDoc002.html',type:'shared',width:700,height:712})">


As an example, the regular expression would operate on the above HTML and produce the resulting array:

images/sample001.jpg

images/sample002.png

images/sample003.bmp

testDoc001.htm

testDoc002.html

Can someone help me out? Thanks so much.

miku
  • 181,842
  • 47
  • 306
  • 310
Ann Sanderson
  • 407
  • 3
  • 8
  • 17

3 Answers3

1

Save yourself the frustration and bugs that you'll encounter trying to parse HTML with regular expressions. Use an HTML parser like HTML Agility Pack.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • While I agree that regex and HTML [seldom go together](http://stackoverflow.com/a/1732454/89391), I think something like link extraction with regular expressions is ok. – miku Apr 13 '12 at 21:10
  • @miku: My experience is that you can make it work for the small subset of things that you test it with. And then some new construct comes along and breaks it. I've found that using an HTML parser lets me get the code working sooner, the result is more reliable, and more able to cope with changing conditions. But your mileage may vary. – Jim Mischel Apr 13 '12 at 21:51
0

Maybe something along the lines (using groups) for the images:

IMG[^>]*src="([^"]*)"

and something like this for the popups:

url:'([^']*)'
miku
  • 181,842
  • 47
  • 306
  • 310
  • I put into my code: pattern = @"IMG[^>]*src='([^']*)'"; and didn't get anything to come back. I put in: pattern = @"url:'([^']*)'"; and it gave me: "url:'testDoc001.htm'" and "url:'testDoc002.thml'" Any ideas on how to refine this regex? – Ann Sanderson Apr 13 '12 at 20:27
  • BTW: What language are you using? – miku Apr 13 '12 at 20:28
  • I guess you'll need to set some kind single line/multi line modifier – `RegexOptions.Singleline` maybe? Also you'll need to use matching [groups](http://www.dotnetperls.com/regex-groups). – miku Apr 13 '12 at 20:40
0

in Perl

my $x = "your html";

#$1 - is a first group in match - (.+\.(jpg|png))
while ($x =~ /<img .* src="(.+\.(jpg|png))"/ig) {
    print "$1\n";
}

while ($x =~ /<a( .)* href=".*url:('|")(.+\.htm(l)?)('|").*/ig) {
    print "$3\n";
}

output:

images/sample001.jpg
images/sample002.png
testDoc001.htm
testDoc002.html

regexps <img .* src="(.+\.(jpg|png))" and <a( .)* href=".*url:('|")(.+\.htm(l)?)('|").* are similar in most languages. ig defines that search is case-insensitive and multiple matches

marwinXXII
  • 1,456
  • 14
  • 21