2

I want to get the links to the images from the img src in the html. I have a string of the html that I read into a method which returns an arraylist of the image urls.

Into the method I pass the string of html and the url of the webpage.

I need some help with the regex to get the image name with the extension. If you can help with matching against the html string that would be a bonus. I will accept the right answer or close to it, thank you all.

I heard about HTML parsers but I would rather use this way thank you.

here is my method:

   private ArrayList GetImageLinks(String inputHTML, String link)
    {
        ArrayList imageLinks = new ArrayList();  
        var regex = new Regex(@"<img.*?src=[\"'](.+?)[\"'].*?");

        //using http://gskinner.com/RegExr/ this regex seems to get: <img src="beach.png" for example. while I need just beach.png.

        //match the regex to the html and get all the image links like: image5.png
        //link = inputHTML + link
        //add new link to arraylist



        return imageLinks;
    }
R00059159
  • 171
  • 1
  • 7
  • 13
  • 2
    Parsing HTML with Regex, what could go wrong. – sa_ddam213 Nov 25 '13 at 04:02
  • 3
    There is no good reason to not use HtmlAgilityPack for that. If you really want regular expression - you should do it yourself since at least you'll have minor chance to understand that regular expression in your code when you see it one month later. – Alexei Levenkov Nov 25 '13 at 04:03
  • possible duplicate of [Regex to get src value from an img tag](http://stackoverflow.com/questions/1058852/regex-to-get-src-value-from-an-img-tag) - even contains some regular expression version of solution... – Alexei Levenkov Nov 25 '13 at 04:04
  • [Do not parse HTML with a regex.](http://stackoverflow.com/a/1732454/2316200) – Pierre-Luc Pineault Nov 25 '13 at 04:05
  • The very fact you're having trouble getting your Regex correct should be an alarm bell ringing. Libraries that parse markup can account for the horrible structure that some markup contains. Regex however, cannot. – Simon Whitehead Nov 25 '13 at 04:27
  • you can use WebBrowser to so that. – Thilina H Nov 25 '13 at 11:32

3 Answers3

3

I did not understand what you want to do with image source after extracting.

Here is how you can extract image links.

static IEnumerable<String> GetImageLinks(String inputHTML, String someLink)
{
    const string pattern = @"<img\b[^\<\>]+?\bsrc\s*=\s*[""'](?<L>.+?)[""'][^\<\>]*?\>";

    foreach (Match match in Regex.Matches(inputHTML, pattern, RegexOptions.IgnoreCase))
    {
        var imageLink = match.Groups["L"].Value;

        /* Do something from your image link here*/

        yield return imageLink;
    }
}
Usman Zafar
  • 1,919
  • 1
  • 15
  • 11
1

You can use WebBrowser to do that instead of string manipulation

       private string HtmlUpdateWithImage(string stringHtml)
        {
            System.Windows.Forms.WebBrowser browser = new System.Windows.Forms.WebBrowser();
            browser.Navigate("about:blank");
            HtmlDocument doc = browser.Document;
            doc.Write(stringHtml);

            if (null != browser.Document && null != browser.Document.Images && browser.Document.Images.Count > 0)
            {
                // Here you can get the image list browser.Document.Images
                foreach (System.Windows.Forms.HtmlElement item in browser.Document.Images)
                {
                    // To get file path for each image
                    string imageFilePath = item.GetAttribute("src");
                    // Or either you can set those values

                    item.SetAttribute("src","testPath");
                }
            }
            return "<HTML>" + browser.Document.Body.OuterHtml + "</HTML>";
        }
Thilina H
  • 5,754
  • 6
  • 26
  • 56
0

If you want just take name of image, just use method GetFileName() of class Path:

string internetAddress=@"http://hello.com/a/s/s/fff.jpg";
string takeName=Path.GetFileName(internetAddress);
StepUp
  • 36,391
  • 15
  • 88
  • 148