5

I have html content which looks like

<body>Hello world</div><div>New day</div></body>

I would like to parse this html snippet and add a starting div tag before Hello. What is the approach I could follow? I tried to use HTMLCLeaner but it didnt help Basically what this means is find ending div tags without matching start div tags and add them.

Ripon Al Wasim
  • 36,924
  • 42
  • 155
  • 176
Thunderhashy
  • 5,291
  • 13
  • 43
  • 47
  • If you use java try using `Jsoup`. Something like `Jsoup.clean("Hello world
    New day
    ", Whitelist.relaxed());`
    – RP- Mar 07 '14 at 20:50
  • Just tried that and it gives me "Hello world
    New day
    "
    – Thunderhashy Mar 07 '14 at 21:01
  • This is an interesting question. Most parsers won't try to insert the opening tag because there's no way to tell where it ought to go; technically, it could go anywhere from the start of `body` to the start of the closing `div`. Is there some pattern to how these start tags are going amiss that can be used to predict where they should be inserted? – Jordan Gray Mar 20 '14 at 17:03

7 Answers7

2

If you use java try using Jsoup. Something like

Jsoup.clean("<body><div>Hello world</div><div>New day</div></body>", Whitelist.relaxed());

This will give you the proper output string.

UPDATE

You can use Jsoup.parse(html) which returns a Document on which you can call toString() to get the fixed html which will include all the html and body tags as well. It will give you the following output for you html.

   <html>
    <head></head>
    <body>
      <div>
        Hello world
      </div
      <div>
        New day
      </div>
    </body>
   </html>

As you said most of the parser will fix the end tags but not start tags as they can not decide on where to start the start tags except just before the wrong end tag and it will be useless to add the start tag there just before the end tag.

You may need to implement you own logic for that as Trevor Hutto's suggestion (Stack based approach) bellow but it will have its own complications depends on your requirement.

Nikitesh
  • 1,287
  • 1
  • 17
  • 38
RP-
  • 5,827
  • 2
  • 27
  • 46
1

You could use a stack.

Push on open tags, then when you hit a close tag, pop and compare the popped tag to the one you just ran into.

So obviously, if you have a mismatch, and it is a div, you can do something.

Trevor Hutto
  • 2,112
  • 4
  • 21
  • 29
1

John Resig's HTML Parser does a pretty good job of this. It's a little old, but it's still worked for a large majority of my use cases.

Edit: Actually, it seems to only fix missing closing tags, not opening tags...though some tweaks might be able to have it do the latter.

Collin Henderson
  • 1,154
  • 10
  • 22
  • Exactly, thats what I am looking for. All the parsers only fix missing closing tags but none add missing start tags... Really does no one parser does what I need ? – Thunderhashy Mar 07 '14 at 21:07
0

You can use the same technique that's used in parenthesis balancing, except instead of returning True/False, you would fix the tag instead. I did this for a work project once:

Recursive method for parentheses balancing [python]

What Trevor is describing is the same thing I'm describing (used in parenthesis balancing).

Community
  • 1
  • 1
antimatter
  • 3,240
  • 2
  • 23
  • 34
0

I have created a Javascript/jQuery solution to add missing starting tags:

Demo Fiddle/Watch Fullscreen

Add any HTML to the body with missing tags like:

hello</h3>
<p>hai</p>
Welcome to fiddle</span>
</div>

Javascript/jQuery

var content;
var i;
var result="";
var previousTag="";

function exeq(){
    var a = content.lastIndexOf('<body>');
    var z = content.lastIndexOf('</body>');
    content = content.substring(a+6,z);

    while(i!=-1){
    var startAngle = content.indexOf('<');
    var endAngle = content.indexOf('>');
        i=endAngle;
    var ele = content.substring(startAngle,endAngle+1);
        if(ele.indexOf("/")!=-1)
        {
            if(previousTag != ele.replace("/",""))
                result = result + ele.replace("/","");
        }
    result = result + content.substring(0,endAngle+1);
    content = content.substring(endAngle+1);
    previousTag = ele;
    }

    /*Below part only to append result to body*/
    $('body').append('<h4>Result</h4><textarea>'+result+'</textarea>');
    /******************************************/
}

$.get(window.location.href,function(data){
    typeof(data)=="object"?window.location = window.location.href:
    content = data;
    exeq();
});
Zword
  • 6,605
  • 3
  • 27
  • 52
0
<body>
<div>Hello world</div>
<div>New day</div>
</body>

You can add a div before the hello world or you can remove the closing div after the hello world. Hello world New day

Jimi Gautam
  • 17
  • 1
  • 2
-2

You dont need a HTMLCLeaner or any tool, if you need to work with html is very simple just remember all tag <'something'> is close with a or use a simple <'something'/> for summarize one!