1

I'm currently trying to parse a webpage to get a certain string:

<script type="text/javascript" src="./interceptor/resource/org.apache.wicket.resource.JQueryResourceReference/jquery/jquery-3.4.1-ver-220AFD743D9E9643852E31A135A9F3AE.js?requestSecurityToken=610f15bd-0e23-4ac5-90c3-c0829ad8024e"></script>

This is the code I came up with to load the web page:

using (HttpClient http = new HttpClient())
{               
    var response = await http.GetStringAsync(pagelink);
    Console.WriteLine(response);
    HtmlDocument pageDocument = new HtmlDocument();
    pageDocument.LoadHtml(response);

    var token = pageDocument.DocumentNode.SelectSingleNode("").InnerText;
    Console.WriteLine(token);
}

The issue is that I need to get from the string that I mentioned earlier only the token: 610f15bd-0e23-4ac5-90c3-c0829ad8024e

I guess there should be a method to do it, but I can't succeed even with Xpath. So I was wondering if there were any way to parse it from framed string for example:

left string: requestSecurityToken= right string: ></script>

Filburt
  • 17,626
  • 12
  • 64
  • 115
  • Seems, essentially, to be a duplicate of https://stackoverflow.com/questions/11040707/c-sharp-regex-for-guid – Caius Jard Dec 27 '20 at 10:16
  • I'd break this up into two parts: Extract the attribute value of `src` and treat it as an `Uri` (which it is). Imho that's way easier than mucking with regex. – Filburt Dec 27 '20 at 10:18
  • Ty @CaiusJard, but I do not understand anything to the regex method, which is used there. – Villette Grandpe Dec 27 '20 at 10:19
  • like this ? var token = pageDocument.DocumentNode.SelectSingleNode("/html/head/script[1]").GetDataAttribute("src"); @Filburt – Villette Grandpe Dec 27 '20 at 10:23

3 Answers3

2

way easier than mucking with regex

I didn't think it was so hard..

var regex = @"\b[a-f0-9]{8}(?:-[a-f0-9]{4}){3}-[a-f0-9]{12}\b";
var m = Regex.Match(html, regex);
Console.WriteLine(m.Value);

If you want to only pull out a Guid that follows a requestSecurityToken= you could:

var regex = @"requestSecurityToken=([a-f0-9]{8}(?:-[a-f0-9]{4}){3}-[a-f0-9]{12})";
var m = Regex.Match(html, regex);
Console.WriteLine(m.Groups[1].Value);
Caius Jard
  • 72,509
  • 5
  • 49
  • 80
  • That will find *any* guid, granted but if you need to find a specific instance **and** could simply treat an Uri for waht it is, why not do so? – Filburt Dec 27 '20 at 10:37
  • You can treat the URI for what it is, but you still have to pull it out of the entire html, and if you're going to do that (with a regex? :) ) you might as well just pull out the thing you actually want – Caius Jard Dec 27 '20 at 12:32
1

Try something like this:

string html = @"<script type=""text/javascript"" src=""./interceptor/resource/org.apache.wicket.resource.JQueryResourceReference/jquery/jquery-3.4.1-ver-220AFD743D9E9643852E31A135A9F3AE.js?requestSecurityToken=610f15bd-0e23-4ac5-90c3-c0829ad8024e""></script>";

// use something to extract value of the src attribute
// I'll use XDocument, but it is not good for HTML documents
XDocument xdoc = XDocument.Parse( html );
string src = xdoc.Root.Attribute("src")?.Value;

if (src is null) throw new Exception();

string[] splitted = src.Split("?");
string queryString = splitted[1]; //"requestSecurityToken=610f15bd-0e23-4ac5-90c3-c0829ad8024e"

// using System.Collections.Specialized;
NameValueCollection parsed = HttpUtility.ParseQueryString( queryString );

Console.WriteLine(parsed["requestSecurityToken"]);
apocalypse
  • 5,764
  • 9
  • 47
  • 95
0

My take without regex or string splitting:

// as already noted, XElement or XDocument may not be the best choice for handling Html
var xe = XElement.Parse(response);

// XPath will make sure you are looking at the right script element
var src = xe.XPathSelectElement("//script[contains(@src, 'requestSecurityToken')]").Attribute("src").Value;

// since relative uri don't support parsing its query, you need to stick in a pseudo base uri
Uri srcuri = new Uri(new Uri("http://localhost"), src);

// finally get the value by name
string token = System.Web.HttpUtility.ParseQueryString(srcuri.Query).Get("requestSecurityToken");
Filburt
  • 17,626
  • 12
  • 64
  • 115