1

I'v the following string:

<table:table-cell table:style-name="Table2.A1" office:value-type="string">
   <text:p text:style-name="P32">
      <text:span text:style-name="T1">test description</text:span>
      <text:span text:style-name="T2"> </text:span>
      <text:span text:style-name="T1">17/07/2013</text:span>
      <text:span text:style-name="T2"> </text:span>
      <text:span text:style-name="T1">test</text:span>
      <text:span text:style-name="T2"> </text:span>
      <text:span text:style-name="T1">test</text:span>
      <text:span text:style-name="T3"></text:span>
      <text:span text:style-name="T1">test</text:span>
      <text:span text:style-name="T3">test <!-- end tag is missing -->
  </text:p>
</table:table-cell>

Is there a way to find the unclosed tag and insert it?

Expected output:

<table:table-cell table:style-name="Table2.A1" office:value-type="string">
   <text:p text:style-name="P32">
      <text:span text:style-name="T1">test description</text:span>
      <text:span text:style-name="T2"> </text:span>
      <text:span text:style-name="T1">17/07/2013</text:span>
      <text:span text:style-name="T2"> </text:span>
      <text:span text:style-name="T1">test</text:span>
      <text:span text:style-name="T2"> </text:span>
      <text:span text:style-name="T1">test</text:span>
      <text:span text:style-name="T3"></text:span>
      <text:span text:style-name="T1">test</text:span>
      <text:span text:style-name="T3">test</text:span>
  </text:p>
</table:table-cell>

Thanks in advance

Daniel
  • 1,432
  • 1
  • 16
  • 31
  • 2
    how can you find something that is *missing*? – HennyH Jul 17 '13 at 08:14
  • 1
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags :) – bobah Jul 17 '13 at 08:16
  • @HennyH, it's possible – RaceBase Jul 17 '13 at 08:18
  • @Reddy: The problem is that the expected output is not properly stated in the question. As HennyH said: You cannot find something that's missing. Nevertheless, the correct answer to the question is in my opinion: No, since he's asking whether it can be done with Regex. – theV0ID Jul 17 '13 at 08:20
  • @theV0ID, he didn't mention anywhere about Regex. Okk, seems like he used Regex tag but not mentioned in question. Moving out of regex, practically this is possible if he's working on software engineering. With Regex, yes, it's not possible – RaceBase Jul 17 '13 at 08:23
  • @Reddy: It was stated in the title at the moment I wrote my comment, as you can see in the [revisions](http://stackoverflow.com/posts/17694545/revisions). – theV0ID Jul 17 '13 at 11:38

2 Answers2

2

Yes. It's quite possible.

Basic problem in Software Engineering/Data Structures. Use Stack to maintain the tags and check whether they are closed properly or not.

  1. Push it as soon as you entered opening tag
  2. Pop it as soon as you entered the closing tag and compare with that to check it's closed properly

I gave basic idea and it's the way to your solution

RaceBase
  • 18,428
  • 47
  • 141
  • 202
  • I used `XMLInputFactory` and `XMLEventReader` to load my XML. when i call `eventReader.nextEvent()` i get my event which can be Start or EndElement. IF the tag is not closed i get an excpetion, but at this point i don't know what to do. At the end i need a string representing a valid XML file. – Daniel Jul 17 '13 at 10:21
  • as I said, you can use Stack to do this. I can't suggest other options because no idea of them. – RaceBase Jul 17 '13 at 11:42
  • 1
    I used this example http://stackoverflow.com/questions/13083756/how-to-find-unclosed-tags-in-xml-with-java but when i get the exception i dont know how to add the tag – Daniel Jul 17 '13 at 12:06
  • in that example also, they used Stack for that. – RaceBase Jul 17 '13 at 15:41
1

A very simple and workable solution is to use any of the avaiable lenient "html" SAXreaders:

  1. TagSoup, or
  2. HTML tidy

I believe both provide (I'm certain tagsoup does) XmlReader implementations that is very forgiving in what kind of "brutal" "HTML" they accept, and they will always produce well formed XML (XHTML). For instance, this is how you could use DOM4J together with TagSoup to "correct" the invalid input.

    SAXReader reader = new SAXReader(
            org.ccil.cowan.tagsoup.Parser.class.getName());
    Document doc = reader.read(...);
    XMLWriter writer = new XMLWriter(System.out);
    writer.write(doc);

Given your input, it produces:

<table:table-cell xmlns:table="urn:x-prefix:table" xmlns:html="http://www.w3.org/1999/xhtml" xmlns:office="urn:x-prefix:office" table:style-name="Table2.A1" office:value-type="string">
   <text:p xmlns:text="urn:x-prefix:text" text:style-name="P32">
      <text:span text:style-name="T1">test description</text:span>
      <text:span text:style-name="T2"> </text:span>
      <text:span text:style-name="T1">17/07/2013</text:span>
      <text:span text:style-name="T2"> </text:span>
      <text:span text:style-name="T1">test</text:span>
      <text:span text:style-name="T2"> </text:span>
      <text:span text:style-name="T1">test</text:span>
      <text:span text:style-name="T3"></text:span>
      <text:span text:style-name="T1">test</text:span>
      <text:span text:style-name="T3">test <!-- end tag is missing -->
  </text:span></text:p>
</table:table-cell>
forty-two
  • 12,204
  • 2
  • 26
  • 36