0

I have a computer generated text that looks like as follows (I've modified the white space to make it more pleasant on the eyes).

<li class="activitybit forum_post">
    <div class="avatar">
            <img src="image.php?s=64ca7b4cc0fa2850f6c763105eee901b&amp;u=37080&amp;dateline=1396817868&amp;type=thumb" alt="killathi's Avatar" />
    </div>
    <div class="content hasavatar">
        <div class="datetime">
             <span class="date">Today,&nbsp;<span class="time">07:14 PM</span></span>
        </div>
        <div class="title">
                <a href="member.php?37080-killathi&amp;s=64ca7b4cc0fa2850f6c763105eee901b">killathi</a> replied to a thread  <a href="showthread.php?1016907-doodles!-Maybe-I-won-t-have-lines-in-it-this-time!!!-MUAHAHHAHAHAAHAH&amp;s=64ca7b4cc0fa2850f6c763105eee901b">doodles! Maybe I won't have lines in it this time!!! MUAHAHHAHAHAAHAH</a> in <a href="forumdisplay.php?208-Fan-Creations&amp;s=64ca7b4cc0fa2850f6c763105eee901b">Fan Creations</a>
        </div>
        <div class="excerpt">I'll hold this one here for now I guess, not really sure where to go with it lol</div>     
        <div class="fulllink"><a href="showthread.php?1016907-doodles!-Maybe-I-won-t-have-lines-in-it-this-time!!!-MUAHAHHAHAHAAHAH&amp;s=64ca7b4cc0fa2850f6c763105eee901b&amp;p=9844450#post9844450">see more</a></div>

    </div>
    <div class="views">77 replies | 3407 view(s)</div>
</li>

I've used the regex : (?:<div class=\"title\">)((?:[\s\S]*?))(?:</div>) and I've extracted the following in the first non-ignored group:

<a href="member.php?37080-killathi&amp;s=64ca7b4cc0fa2850f6c763105eee901b">killathi</a> replied to a thread  <a href="showthread.php?1016907-doodles!-Maybe-I-won-t-have-lines-in-it-this-time!!!-MUAHAHHAHAHAAHAH&amp;s=64ca7b4cc0fa2850f6c763105eee901b">doodles! Maybe I won't have lines in it this time!!! MUAHAHHAHAHAAHAH</a> in <a href="forumdisplay.php?208-Fan-Creations&amp;s=64ca7b4cc0fa2850f6c763105eee901b">Fan Creations</a>

However, I'm wondering if its possible to (and if so how do you) exclude everything within triangular brackets using regex.

I know that I need to do something in ((?:[\s\S]*?)) but I'm not really sure how to do it. (It is safe to assume all text will come in this format).

Aelphaeis
  • 2,593
  • 3
  • 24
  • 42
  • 2
    I just love to refer people to this answer: http://stackoverflow.com/a/1732454/1049308 – John Willemse May 06 '14 at 13:48
  • You do realize that post has nothing to do with this question. – Aelphaeis May 06 '14 at 13:49
  • It has everything to do with this question. You're trying to parse HTML with regex. – John Willemse May 06 '14 at 13:50
  • No. I'm trying to MATCH some particular text and I was wondering if i could exclude a group of data existing within a different group. There is a difference between matching and parsing. – Aelphaeis May 06 '14 at 13:53
  • Can you show your expected result in this case? You want everything that is not a tag? – fnightangel May 06 '14 at 13:57
  • @fnightangel the string I'm looking to get is "killathi replied to a thread doodles! Maybe I won't have lines in it this time!!! MUAHAHHAHAHAAHAH in Fan Creations" – Aelphaeis May 06 '14 at 13:58

3 Answers3

2

To replace everything inside triangular brackets just use this regex:

<[^>]*>

like so:

string output = Regex.Replace(input, "<[^>]*>", "");

here's the docs

Mike H-R
  • 7,726
  • 5
  • 43
  • 65
  • Most of the way there but you'd probably want that to be non-greedy. See my answer. – Steve Pettifer May 06 '14 at 14:08
  • @StevePettifer non greedy isn't needed, I search for anything other than a closing bracket 0 or more times followed by a closing bracket, try it out if you're curious. – Mike H-R May 06 '14 at 14:10
  • Yes, good point, I managed to mis-read that somehow which is amazing when you consider how short it is! – Steve Pettifer May 06 '14 at 14:14
2

I would suggest you to use this library: HTML Agility Pack

You can extract your text as simple as this:

var doc = new HtmlDocument();
doc.LoadHtml(yourHtml);

var node = doc.DocumentNode.SelectSingleNode("//div[@class='title']");
string result = node.InnerText;
wp78de
  • 18,207
  • 7
  • 43
  • 71
Andriy Horen
  • 2,861
  • 4
  • 18
  • 38
1

I'm thinking a RegEx Replace might do it, but it is notoriously hard to manipulate html in the general case with regexes. Here is a fiddle which demonstrates the use of (<.+?>). It works on your example but I make no guarantees!

Steve Pettifer
  • 1,975
  • 1
  • 19
  • 34