Extract all images from html string using Regex

Question

I'm trying to use Regex to extract all image sources from html string. For couple reasons I cannot use HTML Agitility Pack.

I need to extract 'gfx/image.png' from strings which looks like

<table cellpadding="0" cellspacing="0"  border="0" style="height:350px; margin:0; background: url('gfx/image.jpg') no-repeat;">
<table cellpadding="0" cellspacing="0" border="0" background="gfx/image.jpg" style=" width:700px; height:250px; "><tr><td valign="middle">

regex isn't the right tool to parse html files. What if there are special characters like `<`, `&` in the file name? or the file name appears in a comment? — phuclv, Jun 01 '21 at 08:14
Is HTML Agility Pack the best solution then? Is there any universal solution for getting image links out there? — user13657, Jun 01 '21 at 08:17
I don't know what HTML Agility Pack is but the only reliable way to parse HTML files is to use a parser. [Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms](https://stackoverflow.com/q/6751105/995714), [Using regular expressions to parse HTML: why not?](https://stackoverflow.com/q/590747/995714), [you can't parse \[x\]html with regex](https://stackoverflow.com/a/1732454/995714) — phuclv, Jun 01 '21 at 08:22
@phuclv [HTML Agility Pack](https://html-agility-pack.net/) is an HTML parser. — Andrew Morton, Jun 01 '21 at 08:48
Html Agility Pack is relatively simple to use, so it's usually recommended. If you just need the links to the Images, you can also use a [WebBrowser class](https://learn.microsoft.com/en-us/dotnet/api/system.windows.forms.webbrowser) (not Control) to Navigate to an URL (remote or local). When the Document is loaded, you have all the Images already parsed in the `[WebBrowser].Document.Images` collection. You can then download later the Images or get the already downloaded files from the Browser cache. — Jimi, Jun 01 '21 at 09:07

score 1 · Accepted Answer · answered Jun 01 '21 at 08:32

1

you can use this regex: (['"])([^'"]+\.jpg)\1 then get Groups[2], this code is worked fine:

var str = @"<table cellpadding=""0"" cellspacing=""0""  border=""0"" style=""height:350px; margin:0; background: url('gfx/image.jpg') no-repeat;"">
<table cellpadding=""0"" cellspacing=""0"" border=""0"" background=""gfx/image.jpg"" style="" width:700px; height:250px; ""><tr><td valign=""middle"">";
var regex = new Regex(@"(['""])([^'""]+\.jpg)\1");
var match = regex.Match(str);
while (match.Success)
{
    Console.WriteLine(match.Groups[2].Value);
    match = match.NextMatch();
}

answered Jun 01 '21 at 08:32

youbl

134
1
11

1

if you need all image, the regex can change to: `(['"])([^'"]+\.(jpg|png|bmp|gif))\1` – youbl Jun 01 '21 at 08:33
1

if only for extract image, regex is light-weight way, tab or newline，you can change regex like `(['"])([^'"\s]+\.(jpg|png|bmp|gif))\1` and this regex can auto recognise '' and "", don't you see ['"] and \1 ? – youbl Jun 01 '21 at 09:05

Extract all images from html string using Regex

1 Answers1