0

I am trying to figure out if in C# if I have converted a webpage contents into a string, what is the best way to search for extensions. I am just looking to extract URLs within a webpage that ends in .html or .xhtml or edu. In which I don't care what the beginning looks like, which is better EndWith or Regex for finding this.

so if my input looked like this

string str = {var a,b=window.location.href.match(//webhp\?[^#]tune=[^#]/);if(a=b&&b.length>0?"http://www.google.com/logos/2011/lespaul.html"+b[

and i want to pull out http://www.google.com/logos/2011/lespaul.html store that into an array

user990951
  • 1,469
  • 5
  • 26
  • 36

3 Answers3

3

You should use an HTML parser such as sharp-query or HTML Agility Pack and never use regular expressions for parsing html or as the author of this post says some things might happen.

Community
  • 1
  • 1
Darin Dimitrov
  • 1,023,142
  • 271
  • 3,287
  • 2,928
  • If you are just matching/extracting URLs regular expressions should be fine. I believe the point is parsing HTML is beyond regEx. – RBZ Oct 13 '11 at 20:59
  • No, even for url parsing you should avoid regular expressions. You should use Url parsers. They are even built into the .NET framework. – Darin Dimitrov Oct 13 '11 at 21:00
  • What makes an expression regular? – RBZ Oct 13 '11 at 21:52
1

I could come up with this Regex: http:\/\/(.*?)(.html|.xhtml|.edu)
Edit Thanks to @Kakashi http:\/\/.*?\.(?:x?html|edu)

Srinivas
  • 1,780
  • 1
  • 14
  • 27
  • 1
    you're creating unnecessary groups in your regex. `http:\/\/.*?\.(?:x?html|edu)` – Kakashi Oct 13 '11 at 21:06
  • okay i got it working with this..now here is a another question for you somthin like .php?wsdl how would you get that into regex. I thought it was as simple http:\/\/(.*?)(.html|.xhtml|.edu|.php\?wsdl) – user990951 Oct 13 '11 at 21:13
  • @user990951 I didn't get the question, perhaps you could explain it better. I would be glad to help with that as well. – Srinivas Oct 13 '11 at 21:15
  • i am trying to find other extensions as well. so here are the extension i am looking for .php?wsdl (which is an extensions for wsdl) – user990951 Oct 13 '11 at 21:17
  • You can add any extension of type `.php?wsdl` or `aspx` in the same Regex: `http:\/\/.*?\.(?:x?html|edu|php?wsdl|aspx)` – Srinivas Oct 13 '11 at 21:23
0

Try this:

var input = "string str = {var a,b=window.location.href.match(//webhp\\?[^#]tune=[^#]/);if(a=b&&b.length>0?\"http://www.google.com/logos/2011/lespaul.html";
var match =  Regex.Match(input, @"https?:\/{2}[^\n]+\.(?:x?html|edu)");
Console.Write(match.Success? match.Groups[0].Value : "Not found"); //http://www.google.com/logos/2011/lespaul.html  
Kakashi
  • 2,165
  • 14
  • 19