0

I have a SGML file with multiple and elements. A chapter can stand alone on it's own, but a section must be inside a chapter. In a chapter there can be multiple sections. My problem is at the end of the multiple sections the chapter must have an end element . Right now it doesn't it just has the multiple section elements.

I've tried putting the document into an array and counting out the elements but that didn't work.

I tried adding before the next element but that left end elements that were in the wrong order.

I thought maybe treat it like an XML file and find the last child in the file then paste . But didn't know how to do it.

I'm sorry I'm really stumped on this so I don't have any code to post. I have no idea how to approach this.

I really appreciate all your help.

This is the document sample

<doc service="xT">
<body numcols="1">

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap2"> <title>THEORY</title>
<section id="Thoery">
<title>theory Section</title>
<para0 verstatus="ver">
<title>Theory Para 0 </title>
<text>blah blah</text>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="More sections">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter> <title>Chapter Title</title>
<section id="Section ID">
<title>Section Title</title>
<para0>
<title>Para0 Title</title>
<para>blah blah</para>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<line>Title</line>
<text>blah blah</text>
</para0>
</section>

<ipbchap>
<tags></tags>
</ipbchap>

</body>
<rear>
<tags></tags>
</rear>
</doc>

Here is the expected results

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap2"> <title>THEORY</title>
<section id="Thoery">
<title>theory Section</title>
<para0 verstatus="ver">
<title>Theory Para 0 </title>
<text>blah blah</text>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="More sections">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>
</chapter>

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter> <title>Chapter Title</title>
<section id="Section ID">
<title>Section Title</title>
<para0>
<title>Para0 Title</title>
<para>blah blah</para>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<line>Title</line>
<text>blah blah</text>
</para0>
</section>
</chapter>

<ipbchap>
<tags></tags>
</ipbchap>

</body>
<rear>
<tags></tags>
</rear>
</doc>

This is the code that creates this file.

'Read all text of the Master Document
'and create a StringBuilder from it.
'All replacements will be done on the
'StringBuilder as it is more efficient
'than using Strings directly
Dim strMasterDoc = File.ReadAllText(existingMasterFilePath)
Dim newMasterFileBuilder As New StringBuilder(strMasterDoc)

'Create a regex with a named capture group.
'The name is 'EntityNumber' and captures just the
'entity digits for use in building the file name
Dim rx = New Regex("&" & Prefix & "_Ch(?<EntityNumber>\d+(?:-\d+)*)[;]")
Dim rxMatches = rx.Matches(strMasterDoc)

For Each match As Match In rxMatches
    Dim entity = match.ToString
    'Build the file name using the captured digits from the entity in the master file
    Dim entityFileName = Prefix & $"_Ch{match.Groups("EntityNumber")}.sgm.bak"
    Dim entityFilePath = Path.Combine(searchDir, entityFileName)
    'Check if the entity file exists and use its contents
    'to replace the entity in the copy of the master file
    'contained in the StringBuilder
    If File.Exists(entityFilePath) Then
        Dim entityFileContents As String = File.ReadAllText(entityFilePath)
        newMasterFileBuilder.Replace(entity, entityFileContents)
    End If
Next


'write the processed contents of the master file to a different file
File.WriteAllText(newMasterFilePath, newMasterFileBuilder.ToString)

The issue is the code doesn't take in the final element because the element is empty .

So an approach I could take is to have these empty sections added to the document then remove them?

mightymax
  • 431
  • 1
  • 5
  • 16
  • It definitely presents as XML. The problem, as you've already stated, is that it's not valid because it's not consistently closing the `` tags. Is this a file that you're attempting to correct (properly close the tags) before sending it on or doing further processing? Do you have access to modify whatever process is originally creating these files? – DBro May 16 '19 at 13:10
  • Yes this file is created using other code. I'll paste it so you can see – mightymax May 16 '19 at 13:22
  • The file that's represented by `existingMasterFilePath` >> is this the malformed SGML file that you receive from another source? Where does this file come from? – DBro May 16 '19 at 13:31
  • yes it is the malformed SGML file. – mightymax May 16 '19 at 13:33
  • My suggestion would be to see this link first: [Dealing with bad XML](https://stackoverflow.com/a/44765546/11292880) Option 1 should be to demand the provider of the document provide well-formed XML. If that's a dead-end, then there are tolerant markup parsers available that can help. Like HTML Tidy, or HTML Agility Pack. The details at that link also reference Microsoft.Language.Xml.XMLParser, which is evidently "error-tolerant", and may be useful. This may also be useful >> [https://github.com/MindTouch/SGMLReader](https://github.com/MindTouch/SGMLReader) – DBro May 16 '19 at 13:46
  • Thank you I'll look into it. – mightymax May 16 '19 at 13:48

1 Answers1

0

Your markup text parses just fine if you use an SGML parser. You just need to tell SGML which tags can be omitted. Looking at your markup, the end-element tag for chapter, and both the start- and end-element tags for section seem to be omitted/should be inferred, which is reflected in the DOCTYPE I added to your input text:

<!DOCTYPE doc [
    <!ELEMENT doc - - (body,rear)>
    <!ELEMENT body - - (chapter+,ipbchap)>
    <!ELEMENT chapter - O (title?,section+)>
    <!ELEMENT section O O (title?,para0*)>
    <!ELEMENT para0 - - (title?,(line|text|para)*)>
    <!ELEMENT para - - (#PCDATA)>
    <!ELEMENT ipbchap - - (tags?)>
    <!ELEMENT tags - - ANY>
    <!ELEMENT title - - (#PCDATA)>
    <!ELEMENT text - - (#PCDATA)>
    <!ELEMENT line - - (#PCDATA)>
    <!ELEMENT rear - - (tags?)>
    <!ATTLIST doc id CDATA #IMPLIED service CDATA #IMPLIED>
    <!ATTLIST body id CDATA #IMPLIED numcols NUMBER #IMPLIED>
    <!ATTLIST para0 id CDATA #IMPLIED verstatus CDATA #IMPLIED>
    <!ATTLIST (chapter|section|para|ipbchap|tags) id CDATA #IMPLIED>
]>
<doc service="xT">
<body numcols="1">

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap2"> <title>THEORY</title>
<section id="Thoery">
<title>theory Section</title>
<para0 verstatus="ver">
<title>Theory Para 0 </title>
<text>blah blah</text>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="More sections">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter> <title>Chapter Title</title>
<section id="Section ID">
<title>Section Title</title>
<para0>
<title>Para0 Title</title>
<para>blah blah</para>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<line>Title</line>
<text>blah blah</text>
</para0>
</section>

<ipbchap>
<tags></tags>
</ipbchap>

</body>
<rear>
<tags></tags>
</rear>
</doc>

If you run the above input through the osgmlnorm program (part of SP/OpenSP, see below), then it produces the following output:

<DOC SERVICE="xT">
<BODY NUMCOLS="1">
<CHAPTER ID="chap1">
<SECTION>
<PARA0>
<TITLE></TITLE>
</PARA0>
</SECTION>
</CHAPTER>
<CHAPTER ID="chap2">
<TITLE>THEORY</TITLE>
<SECTION ID="Thoery">
<TITLE>theory Section</TITLE>
<PARA0 VERSTATUS="ver">
<TITLE>Theory Para 0 </TITLE>
<TEXT>blah blah</TEXT>
</PARA0>
</SECTION>
<SECTION ID="Next section">
<TITLE>title</TITLE>
<PARA0>
<TITLE>Title</TITLE>
<TEXT>blah blah</TEXT>
</PARA0>
</SECTION>
<SECTION ID="More sections">
<TITLE>title</TITLE>
<PARA0>
<TITLE>Title</TITLE>
<TEXT>blah blah</TEXT>
</PARA0>
</SECTION>
<SECTION ID="section">
<TITLE>title</TITLE>
<PARA0>
<TITLE>Title</TITLE>
<TEXT>blah blah</TEXT>
</PARA0>
</SECTION>
</CHAPTER>
<CHAPTER ID="chap1">
<SECTION>
<PARA0>
<TITLE></TITLE>
</PARA0>
</SECTION>
</CHAPTER>
<CHAPTER ID="chap1">
<SECTION>
<PARA0>
<TITLE></TITLE>
</PARA0>
</SECTION>
</CHAPTER>
<CHAPTER>
<TITLE>Chapter Title</TITLE>
<SECTION ID="Section ID">
<TITLE>Section Title</TITLE>
<PARA0>
<TITLE>Para0 Title</TITLE>
<PARA>blah blah</PARA>
</PARA0>
</SECTION>
<SECTION ID="Next section">
<TITLE>title</TITLE>
<PARA0>
<LINE>Title</LINE>
<TEXT>blah blah</TEXT>
</PARA0>
</SECTION>
</CHAPTER>
<IPBCHAP>
<TAGS></TAGS>
</IPBCHAP>
</BODY>
<REAR>
<TAGS></TAGS>
</REAR>
</DOC>

I hope this is what you had in mind. osgmlnorm (and alternative programs such as osx for producing XML from SGML) is part of James Clark's SP SGML processing package. There are more up-to-date versions (OpenSP/OpenJade) for Linux and Mac OS available, but seeing as you use Visual Basic, I'm pointing you to James' original SP site http://www.jclark.com/sp/ where you can download (old, but still working) builds for Windows.

imhotap
  • 2,275
  • 1
  • 8
  • 16