0

So I have html files. I need to extract all the links and images from them. So basically I need:

<a href="this_is_what_I_need"> and <img src="this_is_also_needed">

I read the files line-by-line and can get it, but only the first one:

    List<string> links = new List<string>();
    if (line.Contains(@"<a href=""") || line.Contains(@"<img src="""))
    {
        if (line.Contains(@"<a href=""")
        {
            links.Add(line.Split(new string[] { @"<a href""" }, StringSplitOptions.None)[1].Split('"')[0]);
        }
        else
        {
            links.Add(line.Split(new string[] { @"<a href=""" }, StringSplitOptions.None)[1].Split('"')[0]);
        }
    }

But a line might contain multiple links and/or images. So how to get all?

fishmong3r
  • 1,414
  • 4
  • 24
  • 51
  • Use http://htmlagilitypack.codeplex.com/. – brz Oct 03 '14 at 09:10
  • Please use a tool like the HTML Agility Pack instead - just search for all "a" or "img" elements, and fetch the "@href" attribute. – Marc Gravell Oct 03 '14 at 09:10
  • Something like [HTML Agility Pack](http://htmlagilitypack.codeplex.com/) will be better than direct string manipulation for this kind of task. – LukeH Oct 03 '14 at 09:11
  • First of all, it would be nice to know what kind of variable line is. second of all, it seems to me that you need a while loop, until the end of file is reached. While (!EOF) – Skaros Ilias Oct 03 '14 at 09:10

1 Answers1

5

I don't think that you are using the right approach for doing that what I can suggest is to take a look at a scrapping tool like HtmlAgilityPack which it is optimized for doing such things

here an example for doing that with <a href="" but you can adapt it for <img src="""

HtmlDocument doc = new HtmlDocument();
doc.Load("mytest.htm");

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a[@class='dn-index-link']"))
{
    Console.WriteLine("node:" + node.GetAttributeValue("href", null));
}
carla
  • 1,970
  • 1
  • 31
  • 44
BRAHIM Kamel
  • 13,492
  • 1
  • 36
  • 47