Getting value string containing InnerHTML regarding differentes rules

Question

I have a string containing the inner value of a DIV (with content editable) coming from client side.
Inside this div there is some SPAN, some P, some Table,... everything you can have.

I'd like to get from this string only the value inside the P element, and sometimes inside the TD of the tables (sometimes a P is inside the TD, sometimes not), and to get the value if it is inside a DIV element.

The string can be :

string text = @"
<P>
    tset if it work 
    <SPAN onresizestart="return false" ondrag=javascript:dragActif(); contentEditable=false style="BACKGROUND-COLOR: #c0d4e6" edth_type="var" edth_var_pob="n" edth_var_pgm="RBLZVALO" edth_var_def="B" edth_var_casse="car" edth_var_lg="050" edth_var_type="c" edth_var_nom="Adr_Leg_Lig1" edth_var_lib="Ligne 1 adresse légale" edth_var_libc="Adr_Leg_Lig1" edth_var_num="1" edth_var_posFich="0">
       Adr_Leg_Lig1
    </SPAN> 
    test
</P>
<P>This <FONT size=2 edth_sizeUTIA4="8">should</FONT> Work
</P>"

I have try to parse it to XML with

XmlDocument xd = new XmlDocument();
xd.LoadXML(text);

but it failed, I try to parse it to HTML with this ParseHTML but it failed too.

I tried filtering every possibility with regex, but some times we have to take what is inside <FONT> like in this example, sometimes we don't.

Is there a way to convert this to HTML on server side with ASP.NET or to convert it to some sort of XML that I could use to manipulate it with its tag and attribute inside the tag ?

EDIT: ASP.NET 2.0, IE5, No Jquery (well IE5) are my configuration, I can't use external libraries.

So in your case you want to see what exact text at the end ? — mybirthname, Jan 05 '16 at 11:11
@mybirthname at the end i only want to get the text inside the string and to get ride of all tag. — Slayner, Jan 05 '16 at 11:30

MyDaftQuestions · Answer 1 · 2016-01-05T11:21:19.490

1

It is unlikely, but using the code you provided, if you made a few minor changes you could use

string text = "<ME><P>tset if it work <SPAN onresizestart='return false' ondrag='javascript:dragActif();' contentEditable='false' style='BACKGROUND-COLOR: #c0d4e6' edth_type='var' edth_var_pob='n' edth_var_pgm='RBLZVALO' edth_var_def='B' edth_var_casse='car' edth_var_lg='050' edth_var_type='c' edth_var_nom='Adr_Leg_Lig1' edth_var_lib='Ligne 1 adresse légale' edth_var_libc='Adr_Leg_Lig1' edth_var_num='1' edth_var_posFich='0'>Adr_Leg_Lig1</SPAN> test</P><P>This <FONT size='2' edth_sizeUTIA4='8'>should</FONT> Work</P></ME>";

XDocument xd = XDocument.Parse(text);

I had to wrap it with a nonsense tag (<ME>) or else it will have multiple roots

I've also had to make sure where you have = you have 2 ' ' after! eg your original had:

This <FONT size=2

Where I used

This <FONT size='2'

Screenshot using the XML visualizer

edited Jan 05 '16 at 11:21

answered Jan 05 '16 at 11:14

MyDaftQuestions

4,487
17
63
120

I think you can assume that the source of the HTML will *not* normally be hard-code strings (i.e. "coming from client side"), so likely out of their control to make web-pages into valid XML! :) – iCollect.it Ltd Jan 05 '16 at 11:19
True. And as soon as the HTML forgets to close a tag, it's pretty much bust, eg `
hello world !
` – MyDaftQuestions Jan 05 '16 at 11:24
1

You would be lucky to find a single website anywhere that was XML compliant :) Browsers can cope with a lot of HTML irregularities – iCollect.it Ltd Jan 05 '16 at 11:25
@MyDaftQuestions I can had a root element, but the big issue is there are some case where the Tags aren't closed. but i could actually prevent this using some RegEx after retrieving the text from all the Node in XML i guess – Slayner Jan 05 '16 at 11:32
@MyDaftQuestions, after trying a few models and parsing them with XML, it seems that there are too many irregularities on our HTML for the XML to understand. So we can't use this for our purpose. It will have been so much easier if we could have. – Slayner Jan 07 '16 at 09:55
You could always improve the HTML if you own it :) – MyDaftQuestions Jan 07 '16 at 10:44

Alex · Accepted Answer · 2016-01-05T12:01:06.830

1

Parsing HTML can is generally difficult and there are many edge cases to think off, so I would recommend the use of an external library like HTMLAgilityPack. If your client does not allow external libraries, you can just download the source code for this and include the relevant projects in your solution.

Using HTMLAgilityPack and the code snippet below I get the following output:

test if it work test
this should Work

You may need to filter for additional elements and to tweak the XPath expression to be more specific.

using System;
using System.Linq;
using HtmlAgilityPack;

namespace MongoDB
{

    public class Program
    {

        public static void Main()
        {
            string text =
                "<p>tset if it work <span onresizestart=\"return false\" ondrag=javascript:dragActif(); contenteditable=false style=\"BACKGROUND-COLOR: #c0d4e6\" edth_type=\"var\" edth_var_pob=\"n\" edth_var_pgm=\"RBLZVALO\" edth_var_def=\"B\" edth_var_casse=\"car\" edth_var_lg=\"050\" edth_var_type=\"c\" edth_var_nom=\"Adr_Leg_Lig1\" edth_var_lib=\"Ligne 1 adresse légale\" edth_var_libc=\"Adr_Leg_Lig1\" edth_var_num=\"1\" edth_var_posfich=\"0\">Adr_Leg_Lig1</span> test</p><p>This <font size=2 edth_sizeutia4=\"8\">should</font> Work</p>";

            HtmlDocument html = new HtmlDocument();
            html.LoadHtml(text);

            var nodes = html.DocumentNode.SelectNodes("//p");

            foreach (
                var line in
                    nodes.Select(node => node.ChildNodes.Where(childNode => childNode.Name!="span"))
                        .Select(
                            textNodes => textNodes.Aggregate(String.Empty, (current, node) => current + node.InnerText))
                )
            {
                Console.WriteLine(line);
            }
        }
    }
}

edited Jan 05 '16 at 12:01

answered Jan 05 '16 at 11:36

Alex

21,273
10
61
73

You were writing your answer i guess you did not see my update, I can not use external librairies for the moment. I have seen a lot of post saying to use this but if I can do it natively it would be beter because i have to explain to the client dev after. – Slayner Jan 05 '16 at 11:38
all the span and their content must disapear from the result string – Slayner Jan 05 '16 at 11:39
You can download the source code for HTMLAgilityPack and include it in your project. Parsing HTML is very difficult as there can me many edge cases and as your client I would prefer you to use a well tested external library rather than you re-inventing the wheel. Looking at the source code, you will only have to include a handful of classes to get it to work. – Alex Jan 05 '16 at 11:53
Okay, I will try it in the same time as the toher answer and will post you a feed back. – Slayner Jan 05 '16 at 13:15
does the Agility pack require a root node like XML? or can there be no root node ? Because if it need one i will have to create it before using the AgilityPack. – Slayner Jan 11 '16 at 09:22
Agility pack doesn't require a root node, however, it may struggle if there are (too many) errors in the HTML syntax, for example unclosed element.s – Alex Jan 11 '16 at 09:49
Okay, because I'm implementig it right now, and will soon start testing it with our project. – Slayner Jan 11 '16 at 09:52
Does the AgilityPack have a way replace all encoding like &nbsp so he did not take them in the final string ? or do i have to remove them manualy using a regEx after getting the innerText of this node ? – Slayner Jan 11 '16 at 10:22
Yes it can handle `&nbsp`, see http://stackoverflow.com/questions/6665488/htmlagilitypack-and-htmldecode – Alex Jan 11 '16 at 10:28
thanks, I will look into this so, to filter more the text receive because sometimes it takes the text of some place we don't want to. – Slayner Jan 11 '16 at 10:31

score 0 · Answer 3 · answered Jan 05 '16 at 11:21

0

HTML is not usually valid XML. You need to use a parser that can parse HTML from strings from which you can extract content.

I do a lot of web-scraping and found that CSQuery does the trick nicely. It converts HTML into an in-memory DOM that can be queried using functions/selectors just like the ones jQuery provide.

answered Jan 05 '16 at 11:21

iCollect.it Ltd

92,391
25
181
202

I Can't use jquery, or external librairies. This is why i have so much trouble doing this. I will Add this info in the post. – Slayner Jan 05 '16 at 11:33
@Slayner: It is not jQuery, just "jQuery-like" syntax. CSQuery is open source C# code, so what's the problem using it? – iCollect.it Ltd Jan 05 '16 at 11:41
Ho Okay my bad, i will look at it to see if it can be applied to my case so. – Slayner Jan 05 '16 at 11:43

score 0 · Answer 4 · edited May 23 '17 at 10:28

0

This thread will tell you how to parse a html file with a regex or with the agility pack: How do I remove all HTML tags from a string without knowing which tags are in it?

edited May 23 '17 at 10:28

Community

1
1

answered Jan 05 '16 at 11:43

Bidou

7,378
9
47
70

i already have a solution with RegEx but it can't really be applied to my case because sometimes the same tag must be delete or his value must be taken into account, this is why i need to be able to work with the tag. – Slayner Jan 05 '16 at 11:44

Getting value string containing InnerHTML regarding differentes rules

4 Answers4