2

I have the html source of a page in a form of string with me:

<html>
    <head>
          <link rel="stylesheet" type="text/css" href="/css/all.css" /> 
    </head>
    <body>
        <a href="/test.aspx">Test</a>
        <a href="http://mysite.com">Test</a>
        <img src="/images/test.jpg"/>
        <img src="http://mysite.com/images/test.jpg"/>
    </body>
</html>

I want to convert all the relative paths to absolute. I want the output be:

<html>
    <head>
          <link rel="stylesheet" type="text/css" href="http://mysite.com/css/all.css" /> 
    </head>
    <body>
        <a href="http://mysite.com/test.aspx">Test</a>
        <a href="http://mysite.com">Test</a>
        <img src="http://mysite.com/images/test.jpg"/>
        <img src="http://mysite.com/images/test.jpg"/>
    </body>
</html>

Note: I want only the relative paths to be converted to absolute ones in that string. The absolute ones which are already in that string should not be touched, they are fine to me as they are already absolute. Can this be done by regex or other means?

Charles
  • 50,943
  • 13
  • 104
  • 142
Rocky Singh
  • 15,128
  • 29
  • 99
  • 146

6 Answers6

21

Don't try to parse html with regex as expained here https://stackoverflow.com/a/1732454/932418 and https://stackoverflow.com/a/1758162/932418

Use an html parser like HtmlAgilityPack instead

string html = 
@"<html>
    <head>
            <link rel=""stylesheet"" type=""text/css"" href=""/css/all.css"" /> 
    </head>
    <body>
        <a href=""/test.aspx"">Test</a>
        <a href=""http://example.com"">Test</a>
        <img src=""/images/test.jpg""/>
        <img src=""http://example.com/images/test.jpg""/>
    </body>
</html>";

StringWriter writer = new StringWriter();
string baseUrl= "http://example.com";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

foreach(var img in doc.DocumentNode.Descendants("img"))
{
    img.Attributes["src"].Value = new Uri(new Uri(baseUrl), img.Attributes["src"].Value).AbsoluteUri;
}

foreach (var a in doc.DocumentNode.Descendants("a"))
{
    a.Attributes["href"].Value = new Uri(new Uri(baseUrl), a.Attributes["href"].Value).AbsoluteUri;
}

doc.Save(writer);

string newHtml = writer.ToString();
carla
  • 1,970
  • 1
  • 31
  • 44
L.B
  • 114,136
  • 19
  • 178
  • 224
5

Add

<base href="http://mysite.com/images/" />

To the head of the page

mplungjan
  • 169,008
  • 28
  • 173
  • 236
  • I have html source in a string with me – Rocky Singh Sep 03 '12 at 20:30
  • I have the string. Where should I insert tag? – Rocky Singh Sep 03 '12 at 20:34
  • After ``. While not answering the specific question, this cleanly, quickly and simply results in an HTML document where all relative URIs are treated as the absolute URIs desired, whatever the context it is copied or otherwise moved into. A definite +1 I'd add that you're using XHTML then you should also replace `` with `` so it has both the HTML way and the XML way of setting a base URI in the document. – Jon Hanna Sep 03 '12 at 22:22
  • Neat! Worked for me. – Sunil Kumar Oct 13 '18 at 13:02
0

Check this out, it could help you.

It is in the following format: http(s)://domain(:port)/AppPath)

HttpContext.Current.Request.Url.Scheme + "://" + HttpContext.Current.Request.Url.Authority + HttpContext.Current.Request.ApplicationPath;

Or you could use:

Page.ResolveUrl("img/youFile");
Juha Syrjälä
  • 33,425
  • 31
  • 131
  • 183
0

Use regular expressions for this. Here is short example

static void Main(string[] args)
    {
        string input = "<html>\n<head>\n<link rel=\"stylesheet\" type=\"text/css\" href=\"/css/all.css\" /> \n</head>\n<body>\n<a href=\"/test.aspx\">Test</a>\n<a href=\"http://mysite.com\">Test</a>\n<img src=\"/images/test.jpg\"/>\n<img src=\"http://mysite.com/images/test.jpg\"/>\n</body>\n</html>";
        string pattern = "((?:src|href)[\\s]*?)(?:\\=[\\s]*?[\\\"\\\'])[\\/*\\\\*]?(?!..+[s]?\\:[\\/]*)(.*?)(?:[\\s\\\"\\\'])";
        var reg = new Regex(pattern, RegexOptions.IgnoreCase);
        string prefix = @"http://mysite.com";
        var result = reg.Replace(input, "$1=\""+prefix+"$2\"");
    }

the result is

<html>
<head>
<link rel="stylesheet" type="text/css" href="http://mysite.com/css/all.css" /> 
</head>
<body>
<a href="http://mysite.com/test.aspx">Test</a>
<a href="http://mysite.com">Test</a>
<img src="http://mysite.com/images/test.jpg"/>
<img src="http://mysite.com/images/test.jpg"/>
</body>
</html>
0

Look at this function:

Private Function ConvertALLrelativeLinksToAbsoluteUri(ByVal html As String, ByVal PageURL As String)
    Dim result As String = Nothing
    ' Getting all Href
    Dim opt As New RegexOptions
    Dim XpHref As New Regex("(href="".*?"")", RegexOptions.IgnoreCase)
    Dim i As Integer
    Dim NewSTR As String = html
    For i = 0 To XpHref.Matches(html).Count - 1
        Application.DoEvents()
        Dim Oldurl As String = Nothing
        Dim OldHREF As String = Nothing
        Dim MainURL As New Uri(PageURL)
        OldHREF = XpHref.Matches(html).Item(i).Value
        Oldurl = OldHREF.Replace("href=", "").Replace("HREF=", "").Replace("""", "")
        Dim NEWURL As New Uri(MainURL, Oldurl)
        Dim NewHREF As String = "href=""" & NEWURL.AbsoluteUri & """"
        NewSTR = NewSTR.Replace(OldHREF, NewHREF)
    Next
    html = NewSTR
    Dim XpSRC As New Regex("(src="".*?"")", RegexOptions.IgnoreCase)
    For i = 0 To XpSRC.Matches(html).Count - 1
        Application.DoEvents()
        Dim Oldurl As String = Nothing
        Dim OldHREF As String = Nothing
        Dim MainURL As New Uri(PageURL)
        OldHREF = XpSRC.Matches(html).Item(i).Value
        Oldurl = OldHREF.Replace("src=", "").Replace("src=", "").Replace("""", "")
        Dim NEWURL As New Uri(MainURL, Oldurl)
        Dim NewHREF As String = "src=""" & NEWURL.AbsoluteUri & """"
        NewSTR = NewSTR.Replace(OldHREF, NewHREF)
    Next
    Return NewSTR
End Function
Brad Larson
  • 170,088
  • 45
  • 397
  • 571
Mahmoud
  • 1
  • 1
0

This works great for me. I uses it on email templates. I'm using the MVC/Razor "~/" at the beginning of each link.

' Parse HTML and make relative links absolute with p_basepath
Public Function ParseHTMLLinks(ByVal MailBodyHTML As String) As String
    ' Declare & intialize variables
    Dim strHTMLBody As String = MailBodyHTML

    ' Set regex variables 
    Dim strSrcSubMatch As String = ""
    Dim strSrcFullUrl As String = ""
    Dim srcPattern As String = "[=""]\/?([^""\s]*(\.gif|\.jpg|\.jpeg|\.png|\.css|\.js))[""\s]"
    Dim srcOptions As RegexOptions = RegexOptions.IgnoreCase
    Dim regex As Regex = New Regex(srcPattern, srcOptions)
    Dim regexSub As Regex = New Regex(srcPattern, srcOptions)
    Dim Matches As MatchCollection = regex.Matches(strHTMLBody)

    Try
        For Each Match As Match In Matches
            ' filter out absolute links
            If InStr(Match.ToString, "://") = 0 And InStr(LCase(Match.ToString), "mailto:") = 0 And InStr(LCase(Match.ToString), "javascript:") = 0 Then
                ' Remove the " at each end of relative path
                strSrcSubMatch = regexSub.Replace(Match.ToString, "$1")
                ' Concatenate the FullPath
                strSrcFullUrl = p_basePath & strSrcSubMatch
                ' Execute the replace
                strHTMLBody = Replace(strHTMLBody, "/" & strSrcSubMatch, strSrcFullUrl)
            End If
        Next

    Catch e As WebException
        'Add errors to List(Of WebException), if any.
        ErrorCodes.Add(e)
    End Try

    Return strHTMLBody 'MailBodyHTML
End Function