0

I need to extract ALL text from some kind of link

<Aid="ctl00_ctl00_ctl00_BodyContent_ContentPlaceHolder1_MainContentPlaceHolder_ResourceHostControl1_resContainer_rptColumn1_ctl00_ctl00_wrapper_downNodesTable_ctl01_ToolsetLink1"href="/Orion/NetPerfMon/NodeDetails.aspx?NetObject=N:78">SFTP</A>

the reason that the A and the id is as I removed all \t\r\n\ ans spaces

the expressions I tried :

\<a.+?>([^\<]+) 

basically I want to extract the SFTP word that I guess is identified as:

start with > any possible character including +- dots commas end with </a>

after trying to use expresso

and browsing the values came to this :

>(\w+)\</a> - I get two values [0] -  >SFTP</A> [1] - SFTP

it works only for a word without any special chars

my problem is that I don't know what can be inside the > <

tried adding a . before the \w+ as the "any character"

still with no success

Light_User
  • 83
  • 2
  • 11

2 Answers2

1

Yes, it's bad to use Regex to parse html, but if you still want it.

string text = Regex.Match(html, "(?<=<a[^>]*>).*?(?=</a>)", RegexOptions.IgnoreCase).Value;

We want to extract the text between <a...> and </a> tags, so we use positive lookbehind for <a...> tag, and positive lookahead for </a> tag. Text by itself is matched as .*?. How can we match <a...> tag? In place of ... can be anything but >, so we use [^>]* making <a[^>]*> for us. Then we wrap it into lookbehind expression (?<=<a[^>]*>). The </a> tag is wrapped into lookahead expression (?=</a>). Then you combine all three into single one.

Regular expression visualization

Ulugbek Umirov
  • 12,719
  • 3
  • 23
  • 31
0

If you didn't remove the spaces, you could use an XmlTextReader to avoid the problems with trying to parse XML with a regex:

using System;
using System.Text;
using System.Xml;

namespace ConsoleApplication1
{
    class Program
    {

        static string GetText(string xmlFragment)
        {
            XmlTextReader tr = new XmlTextReader(xmlFragment, XmlNodeType.Element, null);

            while (tr.Read())
            {
                if (tr.NodeType == XmlNodeType.Text)
                {
                    return tr.Value;
                }
            }

            return "";
        }

        static void Main(string[] args)
        {
            string s = "<A id=\"ctl00_ctl00_ctl00_BodyContent_ContentPlaceHolder1_MainContentPlaceHolder_ResourceHostControl1_resContainer_rptColumn1_ctl00_ctl00_wrapper_downNodesTable_ctl01_ToolsetLink1\" href=\"/Orion/NetPerfMon/NodeDetails.aspx?NetObject=N:78\">SFTP</A>";
            Console.WriteLine(GetText(s)); // outputs "SFTP"
            Console.ReadLine();
        }
    }
}
Andrew Morton
  • 24,203
  • 9
  • 60
  • 84
  • I'm not sure I can because, the if "converting" to xml , the xml should be valid with all the beginnings and endings, or not ? it does have a
    .. so I trimmed the string between some 2 constant divs and extract the links from there
    – Light_User Apr 05 '14 at 20:44
  • Yes, it would need valid XML. If you are already using string operations on the markup, you may end up with invalid parts. I suggest you keep trying with the HTMLAgilitypack route. – Andrew Morton Apr 05 '14 at 20:51
  • the output I get from the page as source are java scripts that eventually write "generated source" that I want to parse so I use 'string _Doc = web_Browser1.DocumentText' to get it , will it be valid ? – Light_User Apr 05 '14 at 21:16
  • Well, as long as you take into account the remarks in [WebBrowser.DocumentText Property](http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser.documenttext%28v=vs.110%29.aspx) then you should get (at least mostly) valid markup. The HTMLAgilitypack should be fairly tolerant of markup errors. – Andrew Morton Apr 05 '14 at 21:22