-2

I'm trying to use Regex to extract all image sources from html string. For couple reasons I cannot use HTML Agitility Pack.

I need to extract 'gfx/image.png' from strings which looks like

<table cellpadding="0" cellspacing="0"  border="0" style="height:350px; margin:0; background: url('gfx/image.jpg') no-repeat;">
<table cellpadding="0" cellspacing="0" border="0" background="gfx/image.jpg" style=" width:700px; height:250px; "><tr><td valign="middle">
user13657
  • 745
  • 3
  • 17
  • 36
  • regex isn't the right tool to parse html files. What if there are special characters like `<`, `&` in the file name? or the file name appears in a comment? – phuclv Jun 01 '21 at 08:14
  • Is HTML Agility Pack the best solution then? Is there any universal solution for getting image links out there? – user13657 Jun 01 '21 at 08:17
  • I don't know what HTML Agility Pack is but the only reliable way to parse HTML files is to use a parser. [Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms](https://stackoverflow.com/q/6751105/995714), [Using regular expressions to parse HTML: why not?](https://stackoverflow.com/q/590747/995714), [you can't parse \[x\]html with regex](https://stackoverflow.com/a/1732454/995714) – phuclv Jun 01 '21 at 08:22
  • @phuclv [HTML Agility Pack](https://html-agility-pack.net/) is an HTML parser. – Andrew Morton Jun 01 '21 at 08:48
  • Html Agility Pack is relatively simple to use, so it's usually recommended. If you just need the links to the Images, you can also use a [WebBrowser class](https://learn.microsoft.com/en-us/dotnet/api/system.windows.forms.webbrowser) (not Control) to Navigate to an URL (remote or local). When the Document is loaded, you have all the Images already parsed in the `[WebBrowser].Document.Images` collection. You can then download later the Images or get the already downloaded files from the Browser cache. – Jimi Jun 01 '21 at 09:07

1 Answers1

1

you can use this regex: (['"])([^'"]+\.jpg)\1 then get Groups[2], this code is worked fine:

var str = @"<table cellpadding=""0"" cellspacing=""0""  border=""0"" style=""height:350px; margin:0; background: url('gfx/image.jpg') no-repeat;"">
<table cellpadding=""0"" cellspacing=""0"" border=""0"" background=""gfx/image.jpg"" style="" width:700px; height:250px; ""><tr><td valign=""middle"">";
var regex = new Regex(@"(['""])([^'""]+\.jpg)\1");
var match = regex.Match(str);
while (match.Success)
{
    Console.WriteLine(match.Groups[2].Value);
    match = match.NextMatch();
}
youbl
  • 134
  • 1
  • 11
  • 1
    if you need all image, the regex can change to: `(['"])([^'"]+\.(jpg|png|bmp|gif))\1` – youbl Jun 01 '21 at 08:33
  • 1
    if only for extract image, regex is light-weight way, tab or newline,you can change regex like `(['"])([^'"\s]+\.(jpg|png|bmp|gif))\1` and this regex can auto recognise '' and "", don't you see ['"] and \1 ? – youbl Jun 01 '21 at 09:05