1

How would i parse the following:

wr("website-url.com</span>")

with regex from HTML code?

Cant seem to figure out how to extract the website-url.com

The whole JavaScript that lies within the HTML:

<script type="text/javascript">wr("<span>maddog");wr("@");wr("website-url.com</span>")</script>

Tried regex like:

wr("(.+?)\s*<\/span>")

but cant seem to get it to work

Alosyius
  • 8,771
  • 26
  • 76
  • 120

4 Answers4

0
string a = <script type="text/javascript">wr("<span>maddog");wr("@");wr("website-url.com</span>")</script>;
string[] b= a.replace("script type="text/javascript">","").replace("</script>","").split(';').ToArray();
string c = b.Last();
string d = c.replace("wr(","").replace("</span","");

d is the final result, but you may modify the code to deal with double quote in the string.

  • If the html code does never change this may work, but even an extra can make this solution fail. – L.B Oct 23 '12 at 21:50
0

It seems that the site you got this javascript doesn't want you to parse its html. It creates dynamic html with a javascript function wr. Below is the code to execute this javascript and parse the resulting code. Hovewer I can not say that this is a simple code to trace

public void Test()
{
    //C# object which will be accessed by javascript
    var csharpObj = new MyCSharpObject();

    //Create Javascript object
    Type scriptType = Type.GetTypeFromCLSID(Guid.Parse("0E59F1D5-1FBE-11D0-8FF2-00A0D10038BC"));
    dynamic obj = Activator.CreateInstance(scriptType, false);
    obj.Language = "Javascript";
    obj.AddObject("csharp", csharpObj);

    //Load Html (your string in question)
    string html = @"<script type=""text/javascript"">wr(""<span>maddog"");wr(""@"");wr(""website-url.com</span>"")</script>";
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);

    //Create "wr" function
    string script = "function wr(s){csharp.wr(s);}";

    //Get the text of script tag                
    script += doc.DocumentNode.SelectSingleNode("//script").InnerText;

    //Execute script
    obj.Eval(script);

    //Load the string created by javascript execution
    doc.LoadHtml(csharpObj.Output);

    //tada.....
    var eMailAddress = doc.DocumentNode.InnerText;

    Console.WriteLine(eMailAddress);
}

[ComVisible(true)]
public class MyCSharpObject
{
    public string Output = "";
    public void wr(string s)
    {
        Output += s;
    }
}

--------EDIT--------

Im not sure how to write the "Get all the wr(*) strings

Although it seems you want a solution like this, I wouldn't depend on Regex to parse an html

public void Test2()
{
    string html = @"<script type=""text/javascript"">wr(""<span>maddog"");wr(""@"");wr(""website-url.com</span>"")</script>";

    var parsedHtml = String.Join("",Regex.Matches(html, @"wr\(\""(.+?)\""\)")
                                            .Cast<Match>()
                                            .Select(m => m.Groups[1].Value));

    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(parsedHtml);
    var eMailAddress = doc.DocumentNode.InnerText;
}
L.B
  • 114,136
  • 19
  • 178
  • 224
0

The idea is:

  • Get all the wr(*) strings with one regex.
  • Remove quote marks (")
  • Remove <span> and </span>

Here is a solution in Python.

import re

def geturl(text):
    '''
    Get all the wr(*) strings.
    Remove quotes.
    Remove <span> and </span>
    '''
    regex = re.compile(r'wr\(([^)]*)\)')
    match = regex.findall(xx)
    url = ''.join([s.replace('"', '') for s in match])
    url = url.replace('<span>', '').replace('</span>', '')
    return url

if __name__ == '__main__':
    xx = '''<script type="text/javascript">wr("<span>maddog");wr("@");wr("website-url.com</span>")</script>'''
    url = geturl(xx)
    print url

Gives maddog@website-url.com

user650654
  • 5,630
  • 3
  • 41
  • 44
  • I think there are two problems with this answer **a)** trying to parse html with regex (see [this link](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)) **b)** OP wants a c# solution – L.B Oct 23 '12 at 21:45
  • How would i write that in C#? :) Im not sure how to write the "Get all the wr(*) strings." – Alosyius Oct 23 '12 at 21:49
-1

If you're using regular expressions to parse HTML, you are probably doing something the hard way that you could be doing the easy way. In C#, try using the HTML Agility Pack. See also the definitive question on the matter.

Community
  • 1
  • 1
Thom Smith
  • 13,916
  • 6
  • 45
  • 91