RegEx: extract all text from link

Question

I need to extract ALL text from some kind of link

<Aid="ctl00_ctl00_ctl00_BodyContent_ContentPlaceHolder1_MainContentPlaceHolder_ResourceHostControl1_resContainer_rptColumn1_ctl00_ctl00_wrapper_downNodesTable_ctl01_ToolsetLink1"href="/Orion/NetPerfMon/NodeDetails.aspx?NetObject=N:78">SFTP</A>

the reason that the A and the id is as I removed all \t\r\n\ ans spaces

the expressions I tried :

\<a.+?>([^\<]+)

basically I want to extract the SFTP word that I guess is identified as:

start with > any possible character including +- dots commas end with </a>

after trying to use expresso

and browsing the values came to this :

>(\w+)\</a> - I get two values [0] -  >SFTP</A> [1] - SFTP

it works only for a word without any special chars

my problem is that I don't know what can be inside the > <

tried adding a . before the \w+ as the "any character"

still with no success

Read this classical question and the answers http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — L.B, Apr 05 '14 at 18:25
`string text = Regex.Match(str, "(?<=]*>).*?(?=)", RegexOptions.IgnoreCase).Value` — Ulugbek Umirov, Apr 05 '14 at 18:32
Works GREAT !!! thank you , is there a chance that you will explain what does it mean ?? — Light_User, Apr 05 '14 at 19:12

score 1 · Answer 1 · answered Apr 05 '14 at 19:20

Yes, it's bad to use Regex to parse html, but if you still want it.

string text = Regex.Match(html, "(?<=<a[^>]*>).*?(?=</a>)", RegexOptions.IgnoreCase).Value;

We want to extract the text between <a...> and </a> tags, so we use positive lookbehind for <a...> tag, and positive lookahead for </a> tag. Text by itself is matched as .*?. How can we match <a...> tag? In place of ... can be anything but >, so we use [^>]* making <a[^>]*> for us. Then we wrap it into lookbehind expression (?<=<a[^>]*>). The </a> tag is wrapped into lookahead expression (?=</a>). Then you combine all three into single one.

Regular expression visualization

Andrew Morton · Answer 2 · 2014-04-05T19:33:05.927

0

If you didn't remove the spaces, you could use an XmlTextReader to avoid the problems with trying to parse XML with a regex:

using System;
using System.Text;
using System.Xml;

namespace ConsoleApplication1
{
    class Program
    {

        static string GetText(string xmlFragment)
        {
            XmlTextReader tr = new XmlTextReader(xmlFragment, XmlNodeType.Element, null);

            while (tr.Read())
            {
                if (tr.NodeType == XmlNodeType.Text)
                {
                    return tr.Value;
                }
            }

            return "";
        }

        static void Main(string[] args)
        {
            string s = "<A id=\"ctl00_ctl00_ctl00_BodyContent_ContentPlaceHolder1_MainContentPlaceHolder_ResourceHostControl1_resContainer_rptColumn1_ctl00_ctl00_wrapper_downNodesTable_ctl01_ToolsetLink1\" href=\"/Orion/NetPerfMon/NodeDetails.aspx?NetObject=N:78\">SFTP</A>";
            Console.WriteLine(GetText(s)); // outputs "SFTP"
            Console.ReadLine();
        }
    }
}

edited Apr 05 '14 at 19:33

answered Apr 05 '14 at 19:24

Andrew Morton

24,203
9
60
84

I'm not sure I can because, the if "converting" to xml , the xml should be valid with all the beginnings and endings, or not ? it does have a
.. so I trimmed the string between some 2 constant divs and extract the links from there
– Light_User Apr 05 '14 at 20:44
Yes, it would need valid XML. If you are already using string operations on the markup, you may end up with invalid parts. I suggest you keep trying with the HTMLAgilitypack route. – Andrew Morton Apr 05 '14 at 20:51
the output I get from the page as source are java scripts that eventually write "generated source" that I want to parse so I use 'string _Doc = web_Browser1.DocumentText' to get it , will it be valid ? – Light_User Apr 05 '14 at 21:16
Well, as long as you take into account the remarks in [WebBrowser.DocumentText Property](http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser.documenttext%28v=vs.110%29.aspx) then you should get (at least mostly) valid markup. The HTMLAgilitypack should be fairly tolerant of markup errors. – Andrew Morton Apr 05 '14 at 21:22

RegEx: extract all text from link

2 Answers2