I need a Regular Expression To Extract Images And HTML Documents

Question

I have various HTML documents that I'm trying to extract the links to: (1) other html documents, (2) image files such as .jpg, .png and .bmp. I need a regular expression to do this and cannot seem to figure it out.

Each of the html pages will have code similar to the following:

IMG style="MARGIN-BOTTOM: 20px; MARGIN-LEFT: 20px" align=right src="images/sample001.jpg">

IMG style="MARGIN-BOTTOM: 25px; MARGIN-LEFT: 25px" align=right src="images/sample002.png">

IMG style="MARGIN-BOTTOM: 20px; MARGIN-LEFT: 20px" align=right src="images/sample003.bmp">

href="javascript:parent.POPUP({url:'testDoc001.htm',type:'shared',width:600,height:645})">

href="javascript:parent.POPUP({url:'testDoc002.html',type:'shared',width:700,height:712})">

As an example, the regular expression would operate on the above HTML and produce the resulting array:

images/sample001.jpg

images/sample002.png

images/sample003.bmp

testDoc001.htm

testDoc002.html

Can someone help me out? Thanks so much.

score 1 · Accepted Answer · answered Apr 13 '12 at 20:44

1

Save yourself the frustration and bugs that you'll encounter trying to parse HTML with regular expressions. Use an HTML parser like HTML Agility Pack.

answered Apr 13 '12 at 20:44

Jim Mischel

131,090
20
188
351

While I agree that regex and HTML [seldom go together](http://stackoverflow.com/a/1732454/89391), I think something like link extraction with regular expressions is ok. – miku Apr 13 '12 at 21:10
@miku: My experience is that you can make it work for the small subset of things that you test it with. And then some new construct comes along and breaks it. I've found that using an HTML parser lets me get the code working sooner, the result is more reliable, and more able to cope with changing conditions. But your mileage may vary. – Jim Mischel Apr 13 '12 at 21:51

score 0 · Answer 2 · answered Apr 13 '12 at 20:18

0

Maybe something along the lines (using groups) for the images:

IMG[^>]*src="([^"]*)"

and something like this for the popups:

url:'([^']*)'

see also: regex testing tool: http://rubular.com/r/W5aSrgMD8B

answered Apr 13 '12 at 20:18

miku

181,842
47
306
310

I put into my code: pattern = @"IMG[^>]*src='([^']*)'"; and didn't get anything to come back. I put in: pattern = @"url:'([^']*)'"; and it gave me: "url:'testDoc001.htm'" and "url:'testDoc002.thml'" Any ideas on how to refine this regex? – Ann Sanderson Apr 13 '12 at 20:27
BTW: What language are you using? – miku Apr 13 '12 at 20:28
I guess you'll need to set some kind single line/multi line modifier – `RegexOptions.Singleline` maybe? Also you'll need to use matching [groups](http://www.dotnetperls.com/regex-groups). – miku Apr 13 '12 at 20:40

score 0 · Answer 3 · answered Apr 13 '12 at 20:40

in Perl

my $x = "your html";

#$1 - is a first group in match - (.+\.(jpg|png))
while ($x =~ /<img .* src="(.+\.(jpg|png))"/ig) {
    print "$1\n";
}

while ($x =~ /<a( .)* href=".*url:('|")(.+\.htm(l)?)('|").*/ig) {
    print "$3\n";
}

output:

images/sample001.jpg
images/sample002.png
testDoc001.htm
testDoc002.html

regexps <img .* src="(.+\.(jpg|png))" and <a( .)* href=".*url:('|")(.+\.htm(l)?)('|").* are similar in most languages. ig defines that search is case-insensitive and multiple matches

I need a Regular Expression To Extract Images And HTML Documents

3 Answers3