1

I receive a very irregular HTML file.

<tr class="" rel="30887721">
    <td class="leftborder timestamp" rel="1472298782"> 
        <span class="updatets "> 9mins </span> 
    </td> 
    <td> 
        <span>
            <style>
                .NFK2{display:none}
                .gPwA{display:inline}
                .Zb70{display:none}
                .vFY2{display:inline}
            </style>
            <span style="display:none">54</span>
            <span class="NFK2">54</span>
            <div style="display:none">54</div>
            <span class="vFY2">124</span>
            <span style="display: inline">.</span>
            <span class="7">240</span>
            <span class="235">.</span>
            <div style="display:none">17</div>
            <span class="NFK2">62</span>
            <span></span>
            <span style="display:none">121</span>
            <span></span>
            <span style="display: inline">187</span>
            <span style="display:none">190</span>
            <span class="Zb70">190</span>
            <span class="NFK2">197</span>
            <span></span>
            <span style="display: inline">.</span>
            <span class="248">80</span>
            <div style="display:none">152</div>
            <span style="display:none">166</span>
            <div style="display:none">166</div>
        </span>
    </td> 
    <td> 80 </td> 
    <td style="text-align:left" class="country" rel="cn"> 
        <span style="white-space:nowrap;"> 
            <img src="/images/1x1.png" style="width: 16px; height: 11px; margin-right: 5px;" class="flags-cn" alt="flag "/> 
            China 
        </span> 
    </td> 
    <td> 
        <div class="progress-indicator response_time" style="width: 114px" value="1314" levels="speed" rel="1314"> 
            <div class="indicator" style="width: 87%; background-color: rgb(0, 173, 173)"></div> 
        </div> 
    </td> 
    <td> 
        <div class="progress-indicator connection_time" style="width: 114px" title="" rel="427" value="427" levels="speed"> 
            <div class="indicator" style="width: 91%; background-color: rgb(0, 173, 173)"></div> 
        </div> 
    </td> 
    <td> HTTP </td> 
    <td nowrap> High +KA </td>
</tr>
<tr class="altshade" rel="30887719"> 
    <td class="leftborder timestamp" rel="1472298723"> 
        <span class="updatets "> 10mins </span> 
    </td> 
    <td> 
        <span> 
            <style>
                .ZQOg{display:none}
                .hAKN{display:inline}
                .sZYH{display:none}
                .euLE{display:inline}
                .pnDV{display:none}
                .yf2r{display:inline}
            </style>
            <span style="display:none">30</span>
            <div style="display:none">30</div>
            <span class="yf2r">124</span>
            <span style="display: inline">.</span>
            <span style="display:none">62</span>
            <span style="display: inline">244</span>
            <span style="display: inline">.</span>
            <span class="pnDV">6</span>
            <div style="display:none">6</div>
            <span class="ZQOg">39</span>
            <div style="display:none">39</div>
            <span style="display:none">71</span>
            <div style="display:none">71</div>
            <span style="display:none">103</span>
            <span class="sZYH">103</span>
            <span></span>
            <span class="euLE">157</span>
            <span style="display:none">188</span>
            <div style="display:none">188</div>
            <div style="display:none">208</div>
            <span style="display:none">220</span>
            <div style="display:none">220</div>
            <span class="sZYH">231</span>
            <span style="display:none">241</span>
            <span class="hAKN">.</span>
            <span class="sZYH">26</span>
            <span></span>
            <span class="sZYH">31</span>
            <span></span>
            <span style="display:none">66</span>
            <div style="display:none">66</div>
            <span style="display:none">84</span>
            <span class="pnDV">84</span>
            <span></span>
            <span style="display:none">166</span>
            <span class="sZYH">166</span>
            <div style="display:none">166</div>
            <span style="display:none">207</span>
            <span></span>
            <span style="display: inline">209</span>
            <span class="sZYH">212</span>
            <div style="display:none">212</div>
            <span style="display:none">241</span>
            <span class="pnDV">241</span> 
        </span> 
    </td> 
    <td> 80 </td> 
    <td style="text-align:left" class="country" rel="hk"> 
        <span style="white-space:nowrap;"> 
            <img src="/images/1x1.png" style="width: 16px; height: 11px; margin-right: 5px;" class="flags-hk" alt="flag "/> 
            Hong Kong 
        </span> 
    </td> 
    <td> 
        <div class="progress-indicator response_time" style="width: 114px" value="1165" levels="speed" rel="1165"> 
            <div class="indicator" style="width: 88%; background-color: rgb(0, 173, 173)"></div> 
        </div> 
    </td> 
    <td> 
        <div class="progress-indicator connection_time" style="width: 114px" title="" rel="287" value="287" levels="speed"> 
            <div class="indicator" style="width: 94%; background-color: rgb(0, 173, 173)"></div> 
        </div> 
    </td> 
    <td> HTTP </td> 
    <td nowrap> High +KA </td>
</tr>

I need to extract every text inside TD of this file, the result should be like this:

9mins    124.240.187.80    80    China        HTTP    High +KA
10mins   124.244.157.209   80    Hong Kong    HTTP    High +KA

I'm facing to many problems to get this result.
The first is because of the invalid markups, like span inside span, style inside span, etc...
The second is because it needs some live parsing, to eval the <style> tags in it.

The Style tags and Style attributes say what elements should appear and what's not.

I'm using C# + CsQuery to extract this results, but, until now, no success.

CQ dom = CQ.Create(text);
CQ tr = dom.Select("table tr");
foreach(var item in tr)
{
    string lastCheck = tr.Select("td:eq(0)").Text(); //9mins
    string ip = tr.Select("td:eq(1)").Text();
    string port = tr.Select("td:eq(2)").Text(); //80
    string country = tr.Select("td:eq(3)").Text(); //China
    string protocol = tr.Select("td:eq(6)").Text(); //HTTP
    string anonymity = tr.Select("td:eq(7)").Text(); //High + KA
}

the IP var returns something like:

".Yj0s{display:none}\n.YSE7{display:inline}\n.zURn{display:none}\n.odWZ{display:inline}637891919292106106137183183183188245245254.85135.166.117177214214225"

if I change IP var to get HTML:

string ip = tr.Select("td:eq(1)").Html();

it returns something like this:

" <span> <style>.PLBz{display:none}\n.hjVo{display:inline}</style><span class=\"PLBz\">92</span><div style=\"display:none\">92</div><span style=\"display:none\">114</span><span class=\"PLBz\">114</span><div style=\"display:none\">114</div><span class=\"hjVo\">122</span><span class=\"PLBz\">240</span><div style=\"display:none\">240</div>.96<span style=\"display:none\">175</span><span class=\"PLBz\">175</span><div style=\"display:none\">191</div><span style=\"display:none\">229</span><span class=\"PLBz\">229</span><div style=\"display:none\">229</div><span style=\"display:none\">241</span><span></span><span class=\"80\">.</span><div style=\"display:none\">22</div><span style=\"display:none\">38</span><div style=\"display:none\">38</div><span class=\"hjVo\">59</span><span class=\"PLBz\">156</span><div style=\"display:none\">156</div>.<span style=\"display:none\">18</span><span class=\"PLBz\">18</span><div style=\"display:none\">18</div><span class=\"PLBz\">45</span><div style=\"display:none\">45</div>104<span class=\"PLBz\">145</span><span></span><span style=\"display:none\">150</span><span class=\"PLBz\">150</span><div style=\"display:none\">150</div><span style=\"display:none\">178</span><div style=\"display:none\">178</div><span></span><span class=\"PLBz\">252</span><div style=\"display:none\">252</div> </span> "

How can I get IP showing the correct value?

Mr Lister
  • 45,515
  • 15
  • 108
  • 150
Andy Schmitt
  • 441
  • 1
  • 6
  • 23

1 Answers1

1

I think there are a few things you need to do here:

  • Remove from the DOM any elements with style="display:none". This can be done fairly easily in CsQuery:

    dom.Select("*:hidden").Remove();
    
  • Parse the contents of the <style> elements and remove elements that are not displayed because of a declaration within the <style> element. Instead of using regular expressions to do this, let's do things properly. Let's use ExCSS to parse the CSS. Here's a method that takes a CsQuery selector, uses ExCSS to parse all <style> elements and removes all elements that the style sets display: none on:

        void RemoveElementsHiddenByStyles(CQ selector)
        {
            var parser = new Parser();
            foreach (IDomElement style in selector.Select("style"))
            {
                StyleSheet stylesheet = parser.Parse(style.InnerText);
                foreach (StyleRule styleRule in stylesheet.StyleRules)
                {
                    if (styleRule.Declarations.Any(d => d.Name == "display" && d.Term.ToString() == "none"))
                    {
                        selector.Select(styleRule.Selector.ToString()).Remove();
                    }
                }
            }
        }
    

    Once the contents of each <style> element has been parsed it can then be removed.

    Take care to do this on a row-by-row basis, in case a style declaration in one row conflicts with one within another row.

  • Remove all whitespace from the resulting element text. I'll leave it up to you to write a suitable RemoveAllWhitespace method. This answer may help.

Putting it all together, we have the following:

        CQ dom = CQ.Create(text);
        dom.Select("*:hidden").Remove();
        CQ rows = dom.Select("table tr");
        foreach (var item in rows)
        {
            CQ row = CQ.Create(item);
            RemoveElementsHiddenByStyles(row);
            row.Select("style").Remove();
            string lastCheck = row.Select("td:eq(0)").Text().Trim(); //9mins
            string ip = RemoveAllWhitespace(row.Select("td:eq(1)").Text()); //124.240.187.80
            string port = row.Select("td:eq(2)").Text().Trim(); //80
            string country = row.Select("td:eq(3)").Text().Trim(); //China
            string protocol = row.Select("td:eq(6)").Text().Trim(); //HTTP
            string anonymity = row.Select("td:eq(7)").Text().Trim(); //High + KA
        }

Note also that I've avoided the use of tr as a variable name: in your code tr contained the list of all rows, but in the body of your loop it looked as if you were using it for an individual row.

Community
  • 1
  • 1
Luke Woodward
  • 63,336
  • 16
  • 89
  • 104
  • Wow, this was a very mind-blowing answer! I had resolved my problem using a lot of RegEx to do my own parse, but the code was long and painful. I didnt know about ExCSS. Thanks a lot for this answer, I really appreciated that. – Andy Schmitt Aug 28 '16 at 18:08