I am converting sgml content to xml content by the help of this link.
Using the sgmlString.replaceAll("<(([^<>]+?)>)([^<>]+?)(?=<(?!\\1))", "<$1$3</$2>");
regular expression I am almost closed to the expected result, but for the following file when there are multiple parallel tags of same name without closing, it is closing the tag only for last tag.
Input:
<SEC-HEADER>0001104659-17-052330.hdr.sgml : 20170817
<ACCEPTANCE-DATETIME>20170817060417
<ACCESSION-NUMBER>0001104659-17-052330
<TYPE>8-K
<PUBLIC-DOCUMENT-COUNT>4
<PERIOD>20170816
<ITEMS>7.01
<ITEMS>8.16
<FILING-DATE>20170817
<DATE-OF-FILING-DATE-CHANGE>20170817
<FILER>
bye bye see you!
</FILER>
</SEC-HEADER>
Output:(Note only one closing of ITEMS tag and two closings of FILER, it is not expected)
<SEC-HEADER>0001104659-17-052330.hdr.sgml : 20170817
<ACCEPTANCE-DATETIME>20170817060417</ACCEPTANCE-DATETIME>
<ACCESSION-NUMBER>0001104659-17-052330</ACCESSION-NUMBER>
<TYPE>8-K</TYPE>
<PUBLIC-DOCUMENT-COUNT>4</PUBLIC-DOCUMENT-COUNT>
<PERIOD>20170816</PERIOD>
<ITEMS>7.01<ITEMS>8.16</ITEMS>
<FILING-DATE>20170817</FILING-DATE>
<DATE-OF-FILING-DATE-CHANGE>20170817</DATE-OF-FILING-DATE-CHANGE>
<FILER>bye bye see you!</FILER></FILER>
</SEC-HEADER>
Expected:
<SEC-HEADER>0001104659-17-052330.hdr.sgml : 20170817
<ACCEPTANCE-DATETIME>20170817060417</ACCEPTANCE-DATETIME>
<ACCESSION-NUMBER>0001104659-17-052330</ACCESSION-NUMBER>
<TYPE>8-K</TYPE>
<PUBLIC-DOCUMENT-COUNT>4</PUBLIC-DOCUMENT-COUNT>
<PERIOD>20170816</PERIOD>
<ITEMS>7.01</ITEMS>
<ITEMS>8.16</ITEMS>
<FILING-DATE>20170817</FILING-DATE>
<DATE-OF-FILING-DATE-CHANGE>20170817</DATE-OF-FILING-DATE-CHANGE>
<FILER>bye bye see you!</FILER>
</SEC-HEADER>
I am in need of your kind suggestion/guidance for following queries:
- Is it a good approach to use regular expression for getting the closing tags to make it in xml format, because I read regular expressions are slow?
- I have quite heavy files to process(Up-to 18000 lines/tags), is there a better way to achieve it?
- How to get the expected result by changing in the regular expression(I am really weak in EL)