0

I am currently writing a script to parse bits of content out of a HTML document.

Here is a example of the code i am parsing:

<div class="tab-content">
<div class="tab-pane fade in active" id="how-to-take">
<div class="panel-body">
<h3>What is Pantoprazole?</h3>
Pantoprazole is a generic drug used to treat certain conditions where there is too much acid in the stomach. It is
used to treat gastric and duodenal ulcers, erosive esophagitis, and gastroesophageal reflux disease (GERD). GERD is
a condition where the acid in the stomach washes back up into the esophagus. <br/> Pantoprazole is a proton pump
inhibitor (PPI). It works by decreasing the amount of acid produced by the stomach.
<h3>How To Take</h3>
Take the tablets 1 hour before a meal without chewing or breaking them and swallow them whole with some water
</div>
</div>
<div class="tab-pane fade" id="alternative-treatments">
<div class="panel-body">
<h3>Alternatives</h3>
Antacids taken as required Antacids are alkali liquids or tablets
that can neutralise the stomach acid. A dose may give quick relief.
There are many brands which you can buy. You can also get some on
prescription. If you have mild or infrequent bouts of dyspepsia you
may find that antacids used as required are all that you need.<br/>
</div>
</div>
<div class="tab-pane fade" id="side-effects">
<div class="panel-body">
<p>Most people who take acid reflux medication do not have any side-effects.
However, side-effects occur in a small number of users. The most
common side-effects are:</p>
<ul>

I am trying to parse all the content between:

<div class="tab-pane fade in active" id="how-to-take">
<div class="panel-body">

and

</div>

I have written the following regex code:

<div class="tab-pane fade in active" id="how-to-take">\n<div class="panel-body">\n(.*?[\s\S]+)\n(?:<\/div>)

and also have tried:

<div class="tab-pane fade in active" id="how-to-take">\n<div class="panel-body">\n(.*?[\s\S]+)\n<\/div>

But it doesn't seem to be stopping at the first <\/div> it continues until the final<div> in the code.

vks
  • 67,027
  • 10
  • 91
  • 124
user1838222
  • 113
  • 1
  • 8
  • 3
    [Don't use regex to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). You could use `HtmlAgilityPack`. – Tim Schmelter May 18 '15 at 09:01
  • Yeh this software is merely internal just wanted to get it completed quickly :). Will not be used after i have compelted this :) – user1838222 May 18 '15 at 09:03
  • 1
    [How to use HTML Agility pack](http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack). This is the regex you are looking for, but you must use a parser. `(?s)
    \s*
    \s*((?:(?!
    ).)*?)\s*
    `
    – Wiktor Stribiżew May 18 '15 at 09:04

2 Answers2

3

Don't use regex to parse HTML. You could use HtmlAgilityPack.

Then this works as desired:

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(File.ReadAllText("Path"));
var divPanelBody = doc.DocumentNode.SelectSingleNode("//div[@class='panel-body']");
string text = divPanelBody.InnerText.Trim();  // null check omitted

Result:

What is Pantoprazole? Pantoprazole is a generic drug used to treat certain conditions where there is too much acid in the stomach. It is used to treat gastric and duodenal ulcers, erosive esophagitis, and gastroesophageal reflux disease (GERD). GERD is a condition where the acid in the stomach washes back up into the esophagus. Pantoprazole is a proton pump inhibitor (PPI). It works by decreasing the amount of acid produced by the stomach. How To Take Take the tablets 1 hour before a meal without chewing or breaking them and swallow them whole with some water

Here's another LINQ approach which i prefer over the XPath syntax:

var divPanelBody = doc.DocumentNode.Descendants("div")
    .FirstOrDefault(d => d.GetAttributeValue("class", "") == "panel-body");

Note that both approaches are case sensitive, so they won't find Panel-Body. You could make the last approach case-insensitive easily:

var divPanelBody = doc.DocumentNode.Descendants("div")
    .FirstOrDefault(d => d.GetAttributeValue("class", "").Equals("panel-body", StringComparison.InvariantCultureIgnoreCase));
Community
  • 1
  • 1
Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
0

You can do this easily by using HtmlAgilityPack

public string GetInnerHtml(string html)
{
      HtmlDocument doc = new HtmlDocument();
      doc.LoadHtml(html);
      var nodes = doc.DocumentNode.SelectNodes("//div[@class=\"panel-body\"]");
      StringBuilder sb = new StringBuilder();
      foreach (var n in nodes)
      {
            sb.Append(n.InnerHtml);
      }
      return sb.ToString();
}