0

I have the following two examples of html-

<a href="http://foo.com">User</a>: <a style="color:#333" href="http://foo.com/word"></a> blue elephant  &middot;

<a href="http://foo.com">User</a>: <a style="color:#333" href="http://foo.com/word">@<b>word</b></a> blue elephant  &middot;

I am trying to parse this using C# to put into a csv file and it is working to an extent however, when the html contains the '@' symbol in it, it will either leave the csv cell blank or not include the word with '@' before it. The main part I am trying to get is @word blue elephant however this is bringing back a blank cell, whereas the first html example brings back blue elephant as desired.

I am using the following technique to do this-

string[] comm = System.Text.RegularExpressions.Regex.Split(content[1], "<a");

How can I alter this to work for the second html example?

Ebikeneser
  • 2,582
  • 13
  • 57
  • 111

1 Answers1

6

You want to use a proper HTML parser like the one in HTML agility pack in this situation (and save yourself from invoking the wrath of Cthulhu)

Some examples of how to use it

carla
  • 1,970
  • 1
  • 31
  • 44
Russ Cam
  • 124,184
  • 33
  • 204
  • 266
  • Ok thanks for the input, I presume my question would not be overly complex when using a tool like this? – Ebikeneser Oct 24 '11 at 22:05
  • No, it's pretty easy to use and understand, if your familiar with the structure of HTML documents. If you're not, you soon will be :) – Russ Cam Oct 24 '11 at 22:32
  • I have mark your answer as useful, however will give full credit once I get my head around the agility pack thank you. – Ebikeneser Oct 24 '11 at 22:34