-1

I have an XML file containing some links

<SupportingDocs>
<LinkedFile>http://llcorp/ll/lljomet.dll/open/864606</LinkedFile>
<LinkedFile>http://llcorp/ll/lljomet.dll/open/1860632</LinkedFile>
<LinkedFile>%20http%3A%2F%2Fllenglish%2Fll%2Fll.exe%2Fopen%2F927515</LinkedFile>
<LinkedFile>%20http%3A%2F%2Fllenglish%2Fll%2Fll.exe%2Fopen%2F973783</LinkedFile>
</SupportingDocs>

I am using a regex "\<[^\<>]+>(?:https?://|www.)[^\<>]+\</[^\<>]+>" and using c# var matches = MyParser.Matches(FormXml); but it is matching first two links but not the encoded ones.

How can we match URL encoded links using RegEx?

Jom
  • 1,877
  • 5
  • 29
  • 46
  • 1
    You're matching two slashes after the https. Those are present in the first two but not the second. There might be other issues, but that's the first I saw. – BurnsBA Aug 03 '17 at 15:40

1 Answers1

1

Here's a snippet that might be helpful. I really question whether or not you're using the best approach, so I made some assumptions (perhaps you just haven't given enough details).

I parsed the xml into a XmlDocument to work with it in code. Relevant tags ("LinkedFile") are pulled out. Each tag is parsed as a Uri. If that fails, it's unescaped and the parse is attempted again. At the end will be a list of strings containing the urls that parsed correctly. If you really need to, you can use your regex on this collection.

// this is for the interactive console
#r "System.Xml.Linq"
using System.Xml;
using System.Xml.Linq;

// sample data, as provided in the post.
string rawXml = "<SupportingDocs><LinkedFile>http://llcorp/ll/lljomet.dll/open/864606</LinkedFile><LinkedFile>http://llcorp/ll/lljomet.dll/open/1860632</LinkedFile><LinkedFile>%20http%3A%2F%2Fllenglish%2Fll%2Fll.exe%2Fopen%2F927515</LinkedFile><LinkedFile>%20http%3A%2F%2Fllenglish%2Fll%2Fll.exe%2Fopen%2F973783</LinkedFile></SupportingDocs>";
var xdoc = new XmlDocument();
xdoc.LoadXml(rawXml)

// will store urls that parse correctly
var foundUrls = new List<String>();

// temp object used to parse urls
Uri uriResult;

foreach (XmlElement node in xdoc.GetElementsByTagName("LinkedFile"))
{
    var text = node.InnerText;

    // first parse attempt
    var result = Uri.TryCreate(text, UriKind.Absolute, out uriResult);

    // any valid Uri will parse here, so limit to http and https protocols
    // see https://stackoverflow.com/a/7581824/1462295
    if (result && (uriResult.Scheme == Uri.UriSchemeHttp || uriResult.Scheme == Uri.UriSchemeHttps))
    {
        foundUrls.Add(uriResult.ToString());
    }
    else
    {
        // The above didn't parse, so check if this is an encoded string.
        // There might be leading/trailing whitespace, so fix that too
        result = Uri.TryCreate(Uri.UnescapeDataString(text).Trim(), UriKind.Absolute, out uriResult);

        // see comments above
        if (result && (uriResult.Scheme == Uri.UriSchemeHttp || uriResult.Scheme == Uri.UriSchemeHttps))
        {
            foundUrls.Add(uriResult.ToString());
        }
    }
}

// interactive output:
> foundUrls
List<string>(4) { "http://llcorp/ll/lljomet.dll/open/864606", "http://llcorp/ll/lljomet.dll/open/1860632", "http://llenglish/ll/ll.exe/open/927515", "http://llenglish/ll/ll.exe/open/973783" }
BurnsBA
  • 4,347
  • 27
  • 39
  • The xml file contains many type of urls in differents parts. actually the code take all matching kind of urls and then process each type. But your answer gave me some options to think. Thanks – Jom Aug 08 '17 at 18:21