I am trying to strip data out of a web page using a c# http module. I just want raw text and images. How can I strip everything else out?
private static Regex reg = new Regex(@"<img src=\t????????");
public override void Write(byte[] buffer, int offset, int count)
{
byte[] data = new byte[count];
Buffer.BlockCopy(buffer, offset, data, 0, count);
string html = System.Text.Encoding.Default.GetString(buffer);
html = reg.Replace(html, string.Empty);
byte[] outdata = System.Text.Encoding.Default.GetBytes(html);
_sink.Write(outdata, 0, outdata.GetLength(0));
}
`, which is perfectly legal for browsers, but for it to be valid XML, would need a closing. `
`. – Anthony Pegram Oct 03 '11 at 19:23