Get multiple substrings from string

Question

So I have html files. I need to extract all the links and images from them. So basically I need:

<a href="this_is_what_I_need"> and <img src="this_is_also_needed">

I read the files line-by-line and can get it, but only the first one:

    List<string> links = new List<string>();
    if (line.Contains(@"<a href=""") || line.Contains(@"<img src="""))
    {
        if (line.Contains(@"<a href=""")
        {
            links.Add(line.Split(new string[] { @"<a href""" }, StringSplitOptions.None)[1].Split('"')[0]);
        }
        else
        {
            links.Add(line.Split(new string[] { @"<a href=""" }, StringSplitOptions.None)[1].Split('"')[0]);
        }
    }

But a line might contain multiple links and/or images. So how to get all?

Please use a tool like the HTML Agility Pack instead - just search for all "a" or "img" elements, and fetch the "@href" attribute. — Marc Gravell, Oct 03 '14 at 09:10
Something like [HTML Agility Pack](http://htmlagilitypack.codeplex.com/) will be better than direct string manipulation for this kind of task. — LukeH, Oct 03 '14 at 09:11
First of all, it would be nice to know what kind of variable line is. second of all, it seems to me that you need a while loop, until the end of file is reached. While (!EOF) — Skaros Ilias, Oct 03 '14 at 09:10

score 5 · Accepted Answer · edited Nov 27 '17 at 17:25

5

I don't think that you are using the right approach for doing that what I can suggest is to take a look at a scrapping tool like HtmlAgilityPack which it is optimized for doing such things

here an example for doing that with <a href="" but you can adapt it for <img src="""

HtmlDocument doc = new HtmlDocument();
doc.Load("mytest.htm");

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a[@class='dn-index-link']"))
{
    Console.WriteLine("node:" + node.GetAttributeValue("href", null));
}

edited Nov 27 '17 at 17:25

carla

1,970
1
31
44

answered Oct 03 '14 at 09:10

BRAHIM Kamel

13,492
1
36
47

Also is this work for php files as well? – fishmong3r Oct 03 '14 at 09:13
@fishmong3r No this is for C#, .NET. PHP has it's own tools/extensions/classes to handle this – user3036342 Oct 03 '14 at 09:15
2

@fishmong3r if you mean the *input* is php; you'll have to try it to see what HTML Agility Pack does with it – Marc Gravell Oct 03 '14 at 09:22
Yeah, of course I mean the input is php. – fishmong3r Oct 03 '14 at 09:24
If there is no a href in a file the prog dies. How to check it first like `if(there is no node) return;` – fishmong3r Oct 03 '14 at 10:27
never mind, I solved it with try-catch combo. Thanks a lot for your help. – fishmong3r Oct 03 '14 at 12:23
@fishmong3r alternately, you could break the foreach loop out -- `var nodes = doc.DocumentNode.SelectNodes("//a[@class='dn-index-link']"));` then check nodes and return, if necessary. – Bret Oct 03 '14 at 16:47

Get multiple substrings from string

1 Answers1