How to remove XML intendations from XML string.?

Question

I'm having a XML string. I'm not able to remove indentation space from XML string. I replaced newlines.

  <person id="13">
      <name>
          <first>John</first>
          <last>Doe</last>
      </name>
      <age>42</age>
      <Married>false</Married>
      <City>Hanga Roa</City>
      <State>Easter Island</State>
      <!-- Need more details. -->
  </person>

How to remove XML indentation spaces from string in GOLANG?

I want this XML as string like,

<person id="13"><name><first>John</first><last>Doe</last></name><age>42</age><Married>false</Married><City>Hanga Roa</City><State>Easter Island</State><!-- Need more details. --></person>

How to do this in GOLANG?

Show us what you have tried so far so that we can help. – Markus W Mahlberg Aug 23 '20 at 11:50 — Markus W Mahlberg, Aug 23 '20 at 11:50
Thank you – Markus W Mahlberg. I solved this. – imaheshwaran s Aug 23 '20 at 12:29 — imaheshwaran s, Aug 23 '20 at 12:29

score 2 · Answer 1 · answered Aug 23 '20 at 16:06

Some background

Unfortunately, XML is not a regular language, and hence you simply cannot reliably process it using regular expression—no matter how complex a regexp you will be able to come up with.

I would start with this brilliant humorous take on ths issue and then read, say, this.

To demonstrate, a simple change to your example which will break your processing could be, for instance, this:

  <person id="13">
      <name>
          <first>John</first>
          <last>Doe</last>
      </name>
      <age>42</age>
      <Married>false</Married>
      <City><![CDATA[Hanga <<Roa>>]]></City>
      <State>Easter Island</State>
      <!-- Need more details. -->
  </person>

Actually, consider this

<last>Von
Neumann</last>

Why do you think you are free to drop the line feed from the contents of that element?

Sure, you'll say one cannot sensibly have a newline in their family name.
OK, but what about this?

<poem author="Chauser">
  <strophe number="1">  The lyf so short,
  the craft so long to lerne.</strophe>
</poem>

You cannot sensibly drop the whitespace between the two parts of that sentence—because to have it was the author's intent.

Well, OK, the full story is defined in the section called "White Space Handling" of the XML spec.
A layman's attempt to describe whitespace handling in XML is as follows:

The XML spec itself does not assign any special meaning to whitespace: the decision on what whitespace means in a paricular place of an XML document is up to the processor of that document.

By extension, the spec does not mandate whether whitespace between any "tags" (those <foo> and </bar> and <quux/> things—appearing at points where XML markup is allowed) is significant or not: that's only you who decides.
To better understand the reason for this, consider the following document:
```
␣Some text which contains an␣emphasized block
which is followed by a linebreak and more text.
```
This is a perfectly valid XML, and I have replaced the space characters right after the  tag and right before the  tag with the Unicode "open box" characters for display purposes.

Note that the whole text ␣Some text which contains an␣ appears between two tags and contains leading and trailing whitespace which is obviously significant — if it were not, the emphasized text (that marked up with the … would be glued together with the preceding text).

The same logic applies to the line break and more text after the  tag.
The XML spec hints at that it may be convenient to define "insignificant" whitespace to mean any whitespace between a pair of adjacent tags which do not define a single element.

XML also has two featrures which complicate processing further:

Character entities (those & and < thingies) allow direct insertion of any Unicode code point: for instance,  would inset a line feed character.
XML support special "CDATA sections", which your parser ostensibly knows nothing about.

An approach to the solution

Before we try to come up with a solution, we'll define what whitespace we intend to treat as insignificant, and drop.

Looks like with your kind of document, the definiton should be: any character data between any two tags should be deleted unless:

it contains at least a singe non-whitespace character, or
it completely defines the contents of a single XML element.

With these considerations in mind, we can write code which parses an input XML stream into tokens and writes them into the output XML stream, while applying the following logic to processing the tokens:

If it sees any XML element other than character data, it encodes them into the output stream.

Additionally, if that element was a start tag, it remembers this fact by setting some flag; otherwise the flag is cleared.
If it sees any character data, it checks to see whether this character data immediately follows a start element (an opening tag), and if so, this character data block is saved away.

The character data block is also saved when there are already such saved blocks present—this is needed because in XML, it's possible to have several adjacent but still distinct character data blocks in a document.
If it sees any XML element, and detects it has one or more saved character blocks, then it first decided whether to put them into the output stream:
- If the element is an end element (the closing tag), all the character data block must be put into the output stream "as is"—because they completely define the contents of a single element.
- Otherwise if at least one of the saved character data blocks contain at least a single non-whitespace character, all blocks are written into the output stream as is.
- Otherwise all the blocks are skipped.

Here is the working code which implements the described approach:

package main

import (
    "encoding/xml"
    "errors"
    "fmt"
    "io"
    "os"
    "strings"
)

const xmlData = `<?xml version="1.0" encoding="utf-8"?>
  <person id="13">
      weird text
      <name>
          <first>John</first>
          <last><![CDATA[Johnson & ]]><![CDATA[ <<Johnson>> ]]><![CDATA[ & Doe ]]></last>
      </name>&#x000d;&#x0020;&#x000a;&#x0009;<age>
      42
      </age>
      <Married>false</Married>
      <City><![CDATA[Hanga <Roa>]]></City>
      <State>Easter Island</State>
      <!-- Need more details. --> what?
      <foo> more <bar/> text </foo>
  </person>
`

func main() {
    stripped, err := removeWS(xmlData)
    if err != nil {
        fmt.Fprintln(os.Stderr, err)
        os.Exit(1)
    }
    fmt.Print(stripped)
}

func removeWS(s string) (string, error) {
    dec := xml.NewDecoder(strings.NewReader(s))

    var sb strings.Builder
    enc := NewSkipWSEncoder(&sb)

    for {
        tok, err := dec.Token()
        if err != nil {
            if err == io.EOF {
                break
            }
            return "", fmt.Errorf("failed to decode token: %w", err)
        }

        err = enc.EncodeToken(tok)
        if err != nil {
            return "", fmt.Errorf("failed to encode token: %w", err)
        }
    }

    err := enc.Flush()
    if err != nil {
        return "", fmt.Errorf("failed to flush encoder: %w", err)
    }

    return sb.String(), nil
}

type SkipWSEncoder struct {
    *xml.Encoder

    sawStartElement bool
    charData        []xml.CharData
}

func NewSkipWSEncoder(w io.Writer) *SkipWSEncoder {
    return &SkipWSEncoder{
        Encoder: xml.NewEncoder(w),
    }
}

func (swe *SkipWSEncoder) EncodeToken(tok xml.Token) error {
    if cd, isCData := tok.(xml.CharData); isCData {
        if len(swe.charData) > 0 || swe.sawStartElement {
            swe.charData = append(swe.charData, cd.Copy())
            return nil
        }
        if isWS(cd) {
            return nil
        }
        return swe.Encoder.EncodeToken(tok)
    }

    if len(swe.charData) > 0 {
        _, isEndElement := tok.(xml.EndElement)
        err := swe.flushSavedCharData(isEndElement)
        if err != nil {
            return err
        }
    }

    _, swe.sawStartElement = tok.(xml.StartElement)

    return swe.Encoder.EncodeToken(tok)
}

func (swe *SkipWSEncoder) Flush() error {
    if len(swe.charData) > 0 {
        return errors.New("attempt to flush encoder while having pending cdata")
    }
    return swe.Encoder.Flush()
}

func (swe *SkipWSEncoder) flushSavedCharData(mustKeep bool) error {
    if mustKeep || !allIsWS(swe.charData) {
        err := encodeCDataList(swe.Encoder, swe.charData)
        if err != nil {
            return err
        }
    }

    swe.charData = swe.charData[:0]

    return nil
}

func encodeCDataList(enc *xml.Encoder, cdataList []xml.CharData) error {
    for _, cd := range cdataList {
        err := enc.EncodeToken(cd)
        if err != nil {
            return err
        }
    }
    return nil
}

func isWS(b []byte) bool {
    for _, c := range b {
        switch c {
        case 0x20, 0x09, 0x0d, 0x0a:
            continue
        }
        return false
    }
    return true
}

func allIsWS(cdataList []xml.CharData) bool {
    for _, cd := range cdataList {
        if !isWS(cd) {
            return false
        }
    }
    return true
}

Playground.

I'm not sure it completely covers all possible weird cases but it should be a good start.

score 1 · Answer 2 · answered Aug 23 '20 at 12:28

Eureka,

First need to remove indentation from XML and then need to remove newline.

// Regex to remove indentation
m1 := regexp.MustCompile(`( *)<`)
newstr := m1.ReplaceAllString(xmlString, "<")

// Replace newline
newLineReplacer := strings.NewReplacer("\n", "", "\r\n", "")
xmlString = newLineReplacer.Replace(newstr)

Find this here, https://play.golang.org/p/Orp2RyPbGP2

score 1 · Answer 3 · answered Aug 25 '20 at 08:10

You can simple remove the new line and the tab character as following:

package main

import (
    "fmt"
    "strings"
)

func main() {
    var s = `<person id="13">
    <name>
        <first>John</first>
        <last>Doe</last>
    </name>
    <age>42</age>
    <Married>false</Married>
    <City>Hanga Roa</City>
    <State>Easter Island</State>
    <!-- Need more details. -->
</person>`
    for {
        if strings.Contains(s, "\n") {
            s = strings.ReplaceAll(s, "\n", "")
        }
        if strings.Contains(s, "\t") {
            s = strings.ReplaceAll(s, "\t", "")
        }
        if !strings.Contains(s, "\n") && !strings.Contains(s, "\t") {
            break
        }
    }
    fmt.Println(s)
}

Result:

<person id="13"><name><first>John</first><last>Doe</last></name><age>42</age><Married>false</Married><City>Hanga Roa</City><State>Easter Island</State><!-- Need more details. --></person>

Mobile Dan · Accepted Answer · 2020-11-18T02:43:12.293

Remove Whitespace-Only Sequences Between XML Tags

func unformatXML(xmlString string) string {
    var unformatXMLRegEx = regexp.MustCompile(`>\s+<`)
    unformatBetweenTags := unformatXMLRegEx.ReplaceAllString(xmlString, "><") // remove whitespace between XML tags
    return strings.TrimSpace(unformatBetweenTags) // remove whitespace before and after XML
}

RegEx Explanation

\s - matches any whitespace including tab, newline, form feed, carriage return and space

+ - matches one or more of whitespace character

RegEx syntax reference: https://golang.org/pkg/regexp/syntax/

Example

package main

import (
    "fmt"
    "regexp"
    "strings"
)

func main() {
    var s = `    
<person id="13">
    <name>
        <first>John</first>
        <last>Doe</last>
    </name>
    <age>42</age>
    <Married>false</Married>
    <City>Hanga Roa</City>
    <State>Easter Island</State>
    <!-- Need more details. -->
</person>   `

    s = unformatXML(s)
    fmt.Println(fmt.Sprintf("'%s'", s)) // single quotes used to confirm no leading or trailing whitespace
}

func unformatXML(xmlString string) string {
    var unformatXMLRegEx = regexp.MustCompile(`>\s+<`)
    unformatBetweenTags := unformatXMLRegEx.ReplaceAllString(xmlString, "><") // remove whitespace between XML tags
    return strings.TrimSpace(unformatBetweenTags) // remove whitespace before and after XML
}

Runnable Example in Go Playground

https://play.golang.org/p/VS1LRNevicz