Regular expression for converting SGML to XML

Question

I am converting sgml content to xml content by the help of this link. Using the sgmlString.replaceAll("<(([^<>]+?)>)([^<>]+?)(?=<(?!\\1))", "<$1$3</$2>"); regular expression I am almost closed to the expected result, but for the following file when there are multiple parallel tags of same name without closing, it is closing the tag only for last tag.

Input:

<SEC-HEADER>0001104659-17-052330.hdr.sgml : 20170817
    <ACCEPTANCE-DATETIME>20170817060417
    <ACCESSION-NUMBER>0001104659-17-052330
    <TYPE>8-K
    <PUBLIC-DOCUMENT-COUNT>4
    <PERIOD>20170816
    <ITEMS>7.01
    <ITEMS>8.16
    <FILING-DATE>20170817
    <DATE-OF-FILING-DATE-CHANGE>20170817
    <FILER>
        bye bye see you!
    </FILER>
</SEC-HEADER>

Output:(Note only one closing of ITEMS tag and two closings of FILER, it is not expected)

  <SEC-HEADER>0001104659-17-052330.hdr.sgml : 20170817
     <ACCEPTANCE-DATETIME>20170817060417</ACCEPTANCE-DATETIME>
     <ACCESSION-NUMBER>0001104659-17-052330</ACCESSION-NUMBER>
     <TYPE>8-K</TYPE>
     <PUBLIC-DOCUMENT-COUNT>4</PUBLIC-DOCUMENT-COUNT>
     <PERIOD>20170816</PERIOD>
     <ITEMS>7.01<ITEMS>8.16</ITEMS>
     <FILING-DATE>20170817</FILING-DATE>
     <DATE-OF-FILING-DATE-CHANGE>20170817</DATE-OF-FILING-DATE-CHANGE>
     <FILER>bye bye see you!</FILER></FILER>
</SEC-HEADER>

Expected:

  <SEC-HEADER>0001104659-17-052330.hdr.sgml : 20170817
         <ACCEPTANCE-DATETIME>20170817060417</ACCEPTANCE-DATETIME>
         <ACCESSION-NUMBER>0001104659-17-052330</ACCESSION-NUMBER>
         <TYPE>8-K</TYPE>
         <PUBLIC-DOCUMENT-COUNT>4</PUBLIC-DOCUMENT-COUNT>
         <PERIOD>20170816</PERIOD>
         <ITEMS>7.01</ITEMS>
         <ITEMS>8.16</ITEMS>
         <FILING-DATE>20170817</FILING-DATE>
         <DATE-OF-FILING-DATE-CHANGE>20170817</DATE-OF-FILING-DATE-CHANGE>
         <FILER>bye bye see you!</FILER>
    </SEC-HEADER>

I am in need of your kind suggestion/guidance for following queries:

Is it a good approach to use regular expression for getting the closing tags to make it in xml format, because I read regular expressions are slow?
I have quite heavy files to process(Up-to 18000 lines/tags), is there a better way to achieve it?
How to get the expected result by changing in the regular expression(I am really weak in EL)

May I suggest that you look into using XSLT, which from what I understand about it might be a great fit for this XML transformation problem? I think using regex here is inviting problems, especially with nested tags. — Tim Biegeleisen, Aug 31 '17 at 05:43
I have no idea of XSLT and how to use it. Could you please provide some link of a guide/working-example or something to refer for it. Thanks for a quick response. — Shailesh Saxena, Aug 31 '17 at 05:47
@TimBiegeleisen Wouldn't an XSLT fail for anything not already welll-formed? — Yunnosch, Aug 31 '17 at 05:49
I am inexperienced with sgml, but shouldn't this `(?!\\1)` be `(?!\\\1)` in order to have a) an escaped `\\` b) a reference to first match? — Yunnosch, Aug 31 '17 at 05:51
@Yunnosch I am also inexperienced with sgml and regex, I tried your suggestion. It solved the ITEMS tag related problem, but still there are two closings of FILER in the output. — Shailesh Saxena, Aug 31 '17 at 05:56
In my experiments, I always have troubles with the sec-header being processed unwantedly. Which part of your regex is supposed to prevent that? — Yunnosch, Aug 31 '17 at 06:54
I have the impression that `(?=<(?!\\1))` matches too often, because `(?!\\1)` always matches by never matching `\\1` (or `\\\1`) . — Yunnosch, Aug 31 '17 at 06:56
I don't have special regex to take care of SEC-HEADER, just removing it manually once regex does it's job, Thanks for your helpful input. — Shailesh Saxena, Aug 31 '17 at 06:59
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/153322/discussion-between-yunnosch-and-shailesh-saxena). — Yunnosch, Aug 31 '17 at 07:04
There is no way that you can convert arbitrary SGML to XML using a regular expression. You might find a solution that works on some subset of SGML, but there will always be valid SGML instances that it can't handle. (If that sounds over-assertive, this is something that can be proved rigorously using computer science theory.) — Michael Kay, Aug 31 '17 at 09:00

score 2 · Answer 1 · answered Sep 02 '17 at 12:58

While it may work for the SGML at hand, in general using regexp match/replace is a terrible approach for converting SGML to XML, because SGML has tag omission/tag inference, attribute name and value omission (like in HTML), and other short forms and features not in the XML profile of SGML.

But there's the dedicated osx SGML to XML conversion program for it which I can fully recommend. Its source is available from http://openjade.sourceforge.net/. If you're on Debian/Ubuntu, you can install it via sudo apt-get install opensp, and if you're on Mac OS (using MacPorts which you must install first) via sudo port install opensp (don't know the MacBrew equivalent, though).

score 0 · Answer 2 · answered Aug 31 '17 at 07:45

I have a solution in perl. It is based on the special treatment of <SEC-HEADER>, incorporating it.

Perl code:

use strict;
use warnings;

my $Input ='';
while(<>)
{
    $Input.=$_;
}

$Input =~ s/<((?!SEC-HEADER)([^\/<>]+?)>)([^<>]+?)(\s*?)(?=<[^\/])/<$1$3<\/$2>$4/g;
print $Input;

In order to translate it to your tool (which I cannot test on and have to guess about its syntax), I propose trying:

sgmlString.replaceAll("<((?!SEC-HEADER)([^\/<>]+?)>)([^<>]+?)(\s*?)(?=<[^\/])", "<$1$3<\/$2>$4");

Sorry, you will have to polish a few tool-specific mistakes yourself, maybe by try and error.
With my perl version I got the following output, which I hope is close enough, it just does not eat the white space inside <FILER>.

Output:

<SEC-HEADER>0001104659-17-052330.hdr.sgml : 20170817
    <ACCEPTANCE-DATETIME>20170817060417</ACCEPTANCE-DATETIME>
    <ACCESSION-NUMBER>0001104659-17-052330</ACCESSION-NUMBER>
    <TYPE>8-K</TYPE>
    <PUBLIC-DOCUMENT-COUNT>4</PUBLIC-DOCUMENT-COUNT>
    <PERIOD>20170816</PERIOD>
    <ITEMS>7.01</ITEMS>
    <ITEMS>8.16</ITEMS>
    <FILING-DATE>20170817</FILING-DATE>
    <DATE-OF-FILING-DATE-CHANGE>20170817</DATE-OF-FILING-DATE-CHANGE>
    <FILER>
        bye bye see you!
    </FILER>
</SEC-HEADER>

Details:

use the negative match with actually the found tag name instead of \1
/ instead of \
at the start, expect a non-/
ignore the special tag-name SEC-HEADER, as you implicitly allowed
capture some whitespace and use it to get indentation and newlines right

If you do want the whitespace eaten, here is a (perl) replace to do that:

$Input =~ s/<(?!\/)([^<>]+)>\s*([^<>]+[^\s<>])\s*<\/\1>/<$1>$2<\/$1>/g;

Guessed version for your tool
(again, sorry for little mistakes, please polish them yourself):

sgmlString.replaceAll("<(?!\/)([^<>]+)>\s*([^<>]+[^\s<>])\s*<\/\1>", "<$1>$2<\/$1>");

Output (applied after first code):

<SEC-HEADER>0001104659-17-052330.hdr.sgml : 20170817
    <ACCEPTANCE-DATETIME>20170817060417</ACCEPTANCE-DATETIME>
    <ACCESSION-NUMBER>0001104659-17-052330</ACCESSION-NUMBER>
    <TYPE>8-K</TYPE>
    <PUBLIC-DOCUMENT-COUNT>4</PUBLIC-DOCUMENT-COUNT>
    <PERIOD>20170816</PERIOD>
    <ITEMS>7.01</ITEMS>
    <ITEMS>8.16</ITEMS>
    <FILING-DATE>20170817</FILING-DATE>
    <DATE-OF-FILING-DATE-CHANGE>20170817</DATE-OF-FILING-DATE-CHANGE>
    <FILER>bye bye see you!</FILER>
</SEC-HEADER>

Regular expression for converting SGML to XML

2 Answers2