1

in C#

I am trying to get both the URLs and the InnerTexts from a text file, I don't have access to a DOM object on the device (only a text file) I am using so have only RegEx to use.

<a href="/LinkClick.aspx?fileticket=a random text string">I want this text</a>

I would need all these sets throughout the text file:

URL = /LinkClick.aspx?fileticket=a random text string
TITLE = I want this text
Ian Vink
  • 66,960
  • 104
  • 341
  • 555
  • 2
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Joe Dec 24 '11 at 20:22
  • 2
    Trying to parse the data with RegEx is not really a good idea. If you've got the HTML text file, you can access the DOM. For example, "using System.Windows.WebBrowser ... HtmlDocument hdoc = HtmlPage.Document;" – paulsm4 Dec 24 '11 at 20:24
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1758162#1758162 – L.B Dec 24 '11 at 20:34

2 Answers2

0

RegEx to parse HTML? Its theoretically possible, but I've not had great success with this unless you can assure that you start with nice, clean, XHTML. The problem is that legitimate HTML is not alway well formed and stuff can span lines and still be HTML but fall through the RegEx. I would recommend that you find some library that parset the HTML for you into a DOM tree or something and XPATH your way through the resulting DOM. C# has an HtmlDocument class, no? I'd try that before I resorted to RegEx.

Bob Kuhar
  • 10,838
  • 11
  • 62
  • 115
  • as I mentioned, I have a limited access and can't parse HTML any other way. I'm running it in Linux on a sub-powered device. – Ian Vink Dec 24 '11 at 21:00
0

You could use a regular expression like this one:

\<a.+?href=(?<q>["'])(.+?)\k<q>.*?>([^\<]+)

URL will be the value of group 2 and TITLE will be the value of group 3.

If your document is valid XHTML, you can also use the classes in the System.Xml namespace to parse your document, then retrieve all <a> elements.

Ry-
  • 218,210
  • 55
  • 464
  • 476