strip out everything out side of and
random text
in html

Question

I am trying to strip data out of a web page using a c# http module. I just want raw text and images. How can I strip everything else out?

private static Regex reg = new Regex(@"<img src=\t????????");

public override void Write(byte[] buffer, int offset, int count)
    {
      byte[] data = new byte[count];
      Buffer.BlockCopy(buffer, offset, data, 0, count);
      string html = System.Text.Encoding.Default.GetString(buffer);

      html = reg.Replace(html, string.Empty);


      byte[] outdata = System.Text.Encoding.Default.GetBytes(html);
      _sink.Write(outdata, 0, outdata.GetLength(0));
    }

[obligatory](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Anthony Pegram, Oct 03 '11 at 19:07
Yeah, I know it's not an exact duplicate, but the accepted answer still answers this one: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — David, Oct 03 '11 at 19:08
that thread is spammed to high hell. Any other solutions, such as which XMl parser to use??? — tdjfdjdj, Oct 03 '11 at 19:14
We're not spamming you. Read the responses to the linked questions to see *why* you don't parse HTML/XML with regex. @Oded, I don't even know a bit of C#, and in Python, there's only an XML parser :P — Blender, Oct 03 '11 at 19:16
@user719825 This is a Q&A site, *not* a forum. Using `???` won't speed up getting a *good* answer. — Rob W, Oct 03 '11 at 19:17
@Blender - Fair enough. Most XML parsers will choke on good HTML unless it was specifically written as XML. — Oded, Oct 03 '11 at 19:17
@Oded: Really? I thought HTML was a subset of XML, so an XML parser would parse HTML also. I'd better read up on that... — Blender, Oct 03 '11 at 19:22
@Blender, HTML does is often not valid XML. Consider the simple case of `
`, which is perfectly legal for browsers, but for it to be valid XML, would need a closing. `
`. — Anthony Pegram, Oct 03 '11 at 19:23
Im using htmlagilitypack now, but I can only get it to load via an actual page. How would I take the data in a buffer (before the page renders on the browser?) — tdjfdjdj, Oct 03 '11 at 19:26
@Blender: You're probably thinking of XHTML, which is the other way around (XHTML is a subset of XML). — Merlyn Morgan-Graham, Oct 03 '11 at 19:28
@MerlynMorgan-Graham: Yep, that's what my ` ` says. *Successful question subject diversion!* — Blender, Oct 03 '11 at 19:30
Lol :) @user719825: Don't use an XML parser unless you want to verify standard conformance, and are sure all your documents (should) be valid XHTML: See the difference between SGML-based HTML and XML-based HTML here - http://en.wikipedia.org/wiki/HTML#SGML-based_versus_XML-based_HTML — Merlyn Morgan-Graham, Oct 03 '11 at 19:39

score 1 · Accepted Answer · answered Oct 03 '11 at 19:14

1

Use an HTML parser, such as the HtmlAgilityPack.

answered Oct 03 '11 at 19:14

George Duckett

31,770
9
95
162

Im using that now, but I can only get it to load via an actual page. How would I take the data in a buffer (before the page renders on the browser?) – tdjfdjdj Oct 03 '11 at 19:25
Is there a `Render` method you could override? (Can't remember if that's just for `Page`s or not). – George Duckett Oct 03 '11 at 19:27

strip out everything out side of and random text in html

1 Answers1

strip out everything out side of and
random text
in html