C# Text Matching HTML

Question

I'm trying to interact with a really crappy "web-service" (cleverly disguised as simple aspx page...) but I don't control the page so I can't tweak the output so I'm stuck with it. The format is always the same like this:

<b>
   <a href=\"http://www.google.com/\" target=\"_blank\">Google Inc</a>
</b>
<br />123 North Main
<br />Hume, ACT
<br />(999) 888-8888

So, I need to parse out the URL, Name, Address, City, State, and Phone? It's not really properly formed XML so I can't use XML parser, and RegEx seems painfully nasty, so am I stuck with String.Match and IndexOf etc?

Thanks for your suggestions... James

score 2 · Answer 1 · edited May 23 '17 at 10:32

2

you can use a HTML parser to parse the page, Html Agility Pack, is a free and robust one. Or you can use any XQuery processor for .Net, Please have a look at this thread to see the drawbacks of using regex for parsing html pages

edited May 23 '17 at 10:32

Community

1
1

answered Feb 21 '13 at 17:11

Sleiman Jneidi

22,907
14
56
77

Brad M · Answer 2 · 2013-02-21T17:18:56.487

There is no need for a regex, assuming the html elements remain static. My solution would be to find the index of the <b>, </b>, and <br /> elements, then just take substrings from one index to the next. For example

int bStartIndex = html.IndexOf("<b>");
int bEndIndex = html.IndexOf("</b>)");
int urlSize = bEndIndex - bStartIndex - 3;
string url = html.Substring(bStartIndex + 3, urlSize);

And yes, this method is a crude hack, however, given the circumstances of a "really crappy web-service", I think it's a fair and straight up solution, albeit tedious.

score 0 · Answer 3 · answered Feb 21 '13 at 17:13

Well, in the past i tried many other ways to use framework methods to get the values inside. But that format is too customized, so i think the only ways is loop every line in the response, and anytime you get a value it will have the url. Anytime you start to read
string in the line, well it will be the address, next the city-state and so on. For any reason, the order of properties of the objects arrive in different order lines, well the code will fail. I recommend you (if it's possible) at least return from the service a JSON format which is easy to deserialize. In other case you should build your own deserializer to get the data as you need.

score 0 · Answer 4 · answered Feb 21 '13 at 17:15

You could use the Regex.Replace (if this is always formatted exactly the same way) like this:

string crappyXML = 
"<b>
   <a href=\"http://www.google.com/\" target=\"_blank\">Google Inc</a>
</b>
<br />123 North Main
<br />Hume, ACT
<br />(999) 888-8888";

string betterXML = Regex.Replace(crappyXML, "</b><br />", "</b><br>");

(You may need to account for that space if there is a space between )

Then your betterXML looks like this:

"<b>
   <a href=\"http://www.google.com/\" target=\"_blank\">Google Inc</a>
</b>
<br>123 North Main
<br />Hume, ACT
<br />(999) 888-8888";

Then you can do another Regex:

betterXML = Regex.Replace(betterXML, "<br />", "</br><br>");

Which would make it look like this:

"<b>
   <a href=\"http://www.google.com/\" target=\"_blank\">Google Inc</a>
</b>
<br>123 North Main
</br><br>Hume, ACT
</br><br>(999) 888-8888";

Then just do this:

betterXML += "</br>";

to close the last tag.

AGAIN, none of my Regex.Replace code accounts for white space. You will have to add that in.

From there, you should be able to use the XML parser and loop through to get your data.

I hope that helps! Let me know any questions.

C# Text Matching HTML

4 Answers4