0

I am trying to strip data out of a web page using a c# http module. I just want raw text and images. How can I strip everything else out?

private static Regex reg = new Regex(@"<img src=\t????????");

public override void Write(byte[] buffer, int offset, int count)
    {
      byte[] data = new byte[count];
      Buffer.BlockCopy(buffer, offset, data, 0, count);
      string html = System.Text.Encoding.Default.GetString(buffer);

      html = reg.Replace(html, string.Empty);


      byte[] outdata = System.Text.Encoding.Default.GetBytes(html);
      _sink.Write(outdata, 0, outdata.GetLength(0));
    }
tdjfdjdj
  • 2,391
  • 13
  • 44
  • 71
  • [obligatory](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Anthony Pegram Oct 03 '11 at 19:07
  • 4
    @Blender - An HTML parser would be a better choice. – Oded Oct 03 '11 at 19:07
  • Yeah, I know it's not an exact duplicate, but the accepted answer still answers this one: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – David Oct 03 '11 at 19:08
  • that thread is spammed to high hell. Any other solutions, such as which XMl parser to use??? – tdjfdjdj Oct 03 '11 at 19:14
  • We're not spamming you. Read the responses to the linked questions to see *why* you don't parse HTML/XML with regex. @Oded, I don't even know a bit of C#, and in Python, there's only an XML parser :P – Blender Oct 03 '11 at 19:16
  • @user719825 This is a Q&A site, *not* a forum. Using `???` won't speed up getting a *good* answer. – Rob W Oct 03 '11 at 19:17
  • @Blender - Fair enough. Most XML parsers will choke on good HTML unless it was specifically written as XML. – Oded Oct 03 '11 at 19:17
  • 1
    @user719825 - Comments are not answers. – Oded Oct 03 '11 at 19:18
  • @Oded: Really? I thought HTML was a subset of XML, so an XML parser would parse HTML also. I'd better read up on that... – Blender Oct 03 '11 at 19:22
  • @Blender, HTML does is often not valid XML. Consider the simple case of `
    `, which is perfectly legal for browsers, but for it to be valid XML, would need a closing. `
    `.
    – Anthony Pegram Oct 03 '11 at 19:23
  • Im using htmlagilitypack now, but I can only get it to load via an actual page. How would I take the data in a buffer (before the page renders on the browser?) – tdjfdjdj Oct 03 '11 at 19:26
  • @Blender: You're probably thinking of XHTML, which is the other way around (XHTML is a subset of XML). – Merlyn Morgan-Graham Oct 03 '11 at 19:28
  • @MerlynMorgan-Graham: Yep, that's what my ` ` says. *Successful question subject diversion!* – Blender Oct 03 '11 at 19:30
  • Lol :) @user719825: Don't use an XML parser unless you want to verify standard conformance, and are sure all your documents (should) be valid XHTML: See the difference between SGML-based HTML and XML-based HTML here - http://en.wikipedia.org/wiki/HTML#SGML-based_versus_XML-based_HTML – Merlyn Morgan-Graham Oct 03 '11 at 19:39

1 Answers1

1

Use an HTML parser, such as the HtmlAgilityPack.

George Duckett
  • 31,770
  • 9
  • 95
  • 162
  • Im using that now, but I can only get it to load via an actual page. How would I take the data in a buffer (before the page renders on the browser?) – tdjfdjdj Oct 03 '11 at 19:25
  • Is there a `Render` method you could override? (Can't remember if that's just for `Page`s or not). – George Duckett Oct 03 '11 at 19:27