How can I parse this HTML to get the content I want?

Question

I am currently trying to parse an HTML document to retrieve all of the footnotes inside of it; the document contains dozens and dozens of them. I can't really figure out the expressions to use to extract all of content I want. The thing is, the classes (ex. "calibre34") are all randomized in every document. The only way to see where the footnotes are located is to search for "hide" and it's always text afterwards and is closed with a < /td> tag. Below is an example of one of the footnotes in the HTML document, all I want is the text. Any ideas? Thanks guys!

<td class="calibre33">1.<span><a class="x-xref" href="javascript:void(0);">
[hide]</a></span></td>
<td class="calibre34">
Among the other factors on which the premium would be based are the
average size of the losses experienced, a margin for contingencies,
a loading to cover the insurer's expenses, a margin for profit or
addition to the insurer's surplus, and perhaps the investment
earnings the insurer could realize from the time the premiums are
collected until the losses must be paid.</td>

Parsing with what? [I hope you don't mean Regex...](http://stackoverflow.com/a/1732454/334053) Tag your post with the language you're using to parse the HTML otherwise nobody will be able to help you. — qJake, Jun 28 '12 at 18:25
Could you look for the `a` tags with the `x-ref` class and grab the closest `td` parent? — Peter Olson, Jun 28 '12 at 18:25
Use either an [`XDocument`](http://msdn.microsoft.com/en-us/library/system.xml.linq.xdocument.aspx) (XML to LINQ) or an [`XmlDocument`](http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx) (POCO) to parse your HTML. Both of these XML libraries are already contained within .NET/C# and are very robust. — qJake, Jun 28 '12 at 18:28
what about no of td elements ? are they same ? i mean , the footnote, is it in same td element always ? if you want to parse in java, you can go for jericho html parser — Pradeep, Jun 28 '12 at 18:28
All footnotes are in td tags, but so are many other things. These HTML documents are gigantic with a lot of content and tags in them, they were very poorly written and it's my job to get the footnotes out and I just don't feel like sitting there and copy+pasting them for 30 years. Also, thanks SpikeX, I'll take a look at it. — JMarsh, Jun 28 '12 at 18:31
one more trick you can do, if everything is after [hide] and then you can look for content having length more than some thresh hold value ( if we assume , footnotes will have length more than some length eg. 50) and when you consider a length makesure no '<' or '>' or any other html tage will not come inside — Pradeep, Jun 28 '12 at 18:34

Marcel N. · Accepted Answer · 2012-06-28T19:19:07.610

4

Use HTMLAgilityPack to load the HTML document and then extract the footnotes with this XPath:

//td[text()='[hide]']/following-sibling::td

Basically,what it does is first selecting all td nodes that contain [hide] and then finally go to and select their next sibling. So the next td. Once you have this collection of nodes you can extract their inner text (in C#, with the support provided in HtmlAgilityPack).

edited Jun 28 '12 at 19:19

answered Jun 28 '12 at 19:13

Marcel N.

13,726
5
47
72

1

Thanks buddy, I'll try this out and let you know how it works. – JMarsh Jun 28 '12 at 19:19
1

@OriginJM: Don't mention it. It should work fine. If not, let me know and I'll try to adjust it. The basic idea is correct. – Marcel N. Jun 28 '12 at 19:20
Hey buddy, I finally got around to testing this out and it worked great. Thanks, you're the man! – JMarsh Jul 06 '12 at 16:12

Chachi · Answer 2 · 2012-06-29T15:13:41.617

How about use MSHTML to parse HTML source? Here is the demo code.enjoy.

public class CHtmlPraseDemo
{
    private string strHtmlSource;
    public mshtml.IHTMLDocument2 oHtmlDoc;
    public CHtmlPraseDemo(string url)
    {
        GetWebContent(url);
        oHtmlDoc = (IHTMLDocument2)new HTMLDocument();
        oHtmlDoc.write(strHtmlSource);
    }
    public List<String> GetTdNodes(string TdClassName)
    {
        List<String> listOut = new List<string>();
        IHTMLElement2 ie = (IHTMLElement2)oHtmlDoc.body;
        IHTMLElementCollection iec = (IHTMLElementCollection)ie.getElementsByTagName("td");
        foreach (IHTMLElement item in iec)
        {
            if (item.className == TdClassName)
            {
                listOut.Add(item.innerHTML);
            }
        }
        return listOut;
    }
    void GetWebContent(string strUrl)
    {
        WebClient wc = new WebClient();
        strHtmlSource = wc.DownloadString(strUrl);
    }



}

class Program
{
 static void Main(string[] args)
    {
        CHtmlPraseDemo oH = new CHtmlPraseDemo("http://stackoverflow.com/faq");

        Console.Write(oH.oHtmlDoc.title);
        List<string> l = oH.GetTdNodes("x");
        foreach (string n in l)
        {
            Console.WriteLine("new td");
            Console.WriteLine(n.ToString());

        }

        Console.Read();
    }
}

I have found mshtml to be horrible. Any self closing tag such as a
will absolutely destroy your parsing attempts. I am currently looking for a new method for parsing — JSON, Jul 27 '16 at 14:01

How can I parse this HTML to get the content I want?

2 Answers2