0

I have written one function to replace value of href with somevalue + original href value

say:-

<a href="/somepage.htm" id="test">

replace with

<a href="http//www.stackoverflow.com/somepage.htm" id="test">

Places where no replacement needed:-

<a href="http//www.stackoverflow.com/somepage.htm" id="test">
 <a href="#" id="test">
<a href="javascript:alert('test');" id="test">
<a href="" id="test">

I have written following method, working with all the cases but not with blank value of href

public static string RelativeToAbsoluteURLS(string text, string absoluteUrl, string pattern = "src|href")
    {
        if (String.IsNullOrEmpty(text))
        {
            return text;
        }
        String value = Regex.Replace(text, "<(.*?)(" + pattern + ")=\"(?!http|javascript|#)(.*?)\"(.*?)>", "<$1$2=\"" + absoluteUrl + "$3\"$4>", RegexOptions.IgnoreCase | RegexOptions.Multiline);

        return value.Replace(absoluteUrl + "/", absoluteUrl);
    }

Written ?!http|javascript|# to ignore http, javascript, #, so it is working for these cases, but if we consider following part

(?!http|javascript|#)(.*?)

And replace this * with +

(?!http|javascript|#)(.+?)

It is not working for empty case.

John Saunders
  • 160,644
  • 26
  • 247
  • 397
vikas
  • 2,780
  • 4
  • 27
  • 37
  • question is big enough, I need it to work with `` case – vikas Mar 11 '14 at 18:46
  • See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – John Saunders Mar 11 '14 at 18:51
  • 1
    obligatory reference: [You can't parse HTML with RegEx](http://stackoverflow.com/a/1732454/1159478) – Servy Mar 11 '14 at 18:52
  • thanks for your comment, I have written a separate class for HTML parsing, and this method is part of utility class, which has certain methods for certain special scenario and this function is one of them. – vikas Mar 11 '14 at 18:56
  • here I am not expecting that this method should work with tags, the only expectation is it should work with `""` And also I am not expecting that it should work for all the cases[as regular expression is not the best way], at least for those I want – vikas Mar 11 '14 at 18:57
  • I tried `(?!http|javascript|#)(.*?){2,}` also – vikas Mar 11 '14 at 19:02

2 Answers2

1

Changing * to + does not work, because you got it completely wrong:

  • * means "zero or more"
  • + means "one or more"

So with + you are forcing the content to be at the place, rather that allowing the content to be missing.

Another thing you got wrong is the placement. The * at that place refers to .. Together, they mean "zero or more characters". So, this part already does not require any content. Therefore, since your regex currently does not work with null-content, something other seems to be requiring that.

Looking at the preceding expressions:

(?!http|javascript|#)(.*?)

The ?! is a zero-width negative lookahead. Zero-width. Negative. That means that it will not require any content either.

So, I got your code, pasted it into the online compiler, then I fed it with your example <a href="" id="test">:

using System.IO; using System; using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        string text = "<a href=\"\" id=\"test\">";
        string pattern = "src|href";
        string absoluteUrl = "YADA";
        string value = Regex.Replace(text, "<(.*?)(" + pattern + ")=\"(?!http|javascript|#)(.*?)\"(.*?)>", "<$1$2=\"" + absoluteUrl + "$3\"$4>", RegexOptions.IgnoreCase | RegexOptions.Multiline);

        Console.WriteLine(value);
    }
}

and guess what it works:

Compiling the source code....
$mcs main.cs -out:demo.exe 2>&1

Executing the program....
$mono demo.exe 
<a href="YADA" id="test">

So, either you are not telling the truth, or you have changed the code when posting it here, or you've got something completely other messed up in your code, sorry.

EDIT:

So, it turned out that the href="" was meant to be ignored.

Then the simplest thing you can do it to add another negative-lookahead that will block the href="" case explicitely. However, note that the placement of that group will be different. The current group is inside the quotes from href, so it cannot "peek" how the whole href-quotes look like. The new group must be before the quotes.

"<(.*?)(" + pattern + ")=(?!\"\")\"(?!http|javascript|#)(.*?)\"(.*?)>"

Note that just-before the first quote from href, I've added a (?!\"\") that will ensure that "there will be no such case that quote follows a quote".

quetzalcoatl
  • 32,194
  • 8
  • 68
  • 107
0

I know that you are asking for RegEx.

But here is an alternative, because I think the use of Uri.IsWellFormedUriString worths it. This way you also you can reuse the helpers functions:

public string RelativeToAbsoluteURLS(string text, string absoluteUrl, string pattern = "src|href")
{
    if (isHrefRelativeURIPath(text)){
        text = absoluteUrl + "/" + System.Text.RegularExpressions.Regex.Replace("///days/hours.htm", @"^\/+", "");
    }

    return text;
}

public bool isHrefRelativeURIPath(string value) {
    if (isLink(value) ||
        value.StartsWith("#") ||
        value.StartsWith("javascript"))
    {
        return false;
    }

    // Others Custom exclusions

    return true;
}


public bool isLink(string value) {
    if (String.IsNullOrEmpty(value))
        return false;

    return Uri.IsWellFormedUriString("http://" + value, UriKind.Absolute);
}
Andre Figueiredo
  • 12,930
  • 8
  • 48
  • 74
  • Thanks @Andre, but this will work only for those cases having only uri not string with html stuff like href="http://someexample.com" :( – vikas Mar 11 '14 at 19:33