0

I have got plenty of text files in XML format like this:

<TITLE>title</TITLE>
<TEXT>text</TEXT>

But I need to change the text of tags to something more like this:

<field name="title">title</field>
<field name="text">text</field>

I am trying to write a little script in bash and use sed command to change the text of the tags.

sed "s/<TEXT>/<field name"text">/g"

I use this command for every tag, but these files contain more than 20 different tags, so I think there must be a more efficient way to do this task.

Thank you for any help.

EDIT: Added sample input and output.

Input

<?xml version="1.0" encoding="UTF-8"?>
<DOC>
    <DOCID>MF-20020103001</DOCID>
    <DATE>01/03/02</DATE>
    <TITLE>Example title</TITLE>
    <TEXT>Very long text...</TEXT>
</DOC>

Output

<?xml version="1.0" encoding="UTF-8"?>
<doc>
    <field name="docid">MF-20020103001</field>
    <field name="date">01/03/02</field>
    <field name="title">Example title</field>
    <field name="text">Very long text...</field>
</doc>
awarus
  • 21
  • 5
  • Please add sample input (valid XML) and your desired output for that sample input to your question. – Cyrus Dec 08 '18 at 17:18
  • It's always a bad idea to read XML, and even worse to modify it, using a non-XML aware tool like sed. Sooner or later you'll come across an XML file that does something perfectly legitimate, like including whitespace in the start or end tag, that your script isn't allowing for. – Michael Kay Dec 08 '18 at 22:59
  • Yes, I understand now that using a tool like sed wasn't the best approach to modify XML files. Eventually, I decided to use XML parser created for this kind of task, thanks to many pieces of advice here. – awarus Dec 09 '18 at 12:07

4 Answers4

1

Here's a reasonable answer since it uses a tool meant for XML.

#!/bin/bash

function transform() {

  {
  cat  <<-'EOF'
    <xsl:stylesheet version="1.0"
      xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >

    <xsl:variable name="lowercase" select="'abcdefghijklmnopqrstuvwxyz'" />
    <xsl:variable name="uppercase" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'" />



    <xsl:output method="xml" encoding="UTF-8"/>

    <xsl:template match="/DOC">
    <doc> 
      <xsl:apply-templates  />
    </doc>
    </xsl:template>

    <xsl:template match="*">
    <field> 
    <xsl:attribute name="name"><xsl:value-of select="translate(local-name(),$uppercase,$lowercase)"/></xsl:attribute>
    <xsl:apply-templates />
    </field>
    </xsl:template>



    </xsl:stylesheet>
EOF
  } |  xsltproc - $1 

}


transform $1

Here's the output I get when I run your input:

<?xml version="1.0" encoding="UTF-8"?>
<doc>
    <field name="docid">MF-20020103001</field>
    <field name="date">01/03/02</field>
    <field name="title">Example title</field>
    <field name="text">Very long text...</field>
</doc>

EDIT: I changed the program above to transform uppercase element names to lower case. Credit goes to Jon W from How can I convert a string to upper- or lower-case with XSLT?

Mark
  • 4,249
  • 1
  • 18
  • 27
0

With the usual advice that it's better to parse xml with an xml parser, if you can count on the structure as given in the example:

$ awk 'BEGIN { FS = "<|>"; OFS = ""} NF > 3 { $0 = "    <field name=\"" tolower($2) "\">"$3"</field>" }1' file
<?xml version="1.0" encoding="UTF-8"?>
<DOC>
    <field name="docid">MF-20020103001</field>
    <field name="date">01/03/02</field>
    <field name="title">Example title</field>
    <field name="text">Very long text...</field>
</DOC>
jas
  • 10,715
  • 2
  • 30
  • 41
0

Here's an awful answer that is very sed, but needs refining:

sed -e "s/<\([^/>]*\)>/<field name='\1'>/g" -e "s/<\/\([^.]*\)>/<\/field>/" 

Here's the output given your input:

<field name='?xml version="1.0" encoding="UTF-8"?'>
<field name='DOC'>
    <field name='DOCID'>MF-20020103001</field>
    <field name='DATE'>01/03/02</field>
    <field name='TITLE'>Example title</field>
    <field name='TEXT'>Very long text...</field>
</field>

You can see the obvious problems with my answer:

  1. ?xml directive was hit
  2. <DOC> element was modified
  3. We didn't lower case the attribute
  4. We would probably mess up any other elements with attributes ( like ?xml above )

The first advice you got was the best. Use an XML parser. If you want, you can go nuts with XSLT. You can then generate an XML style sheet ( .xsl ) to define the transformation.

Mark
  • 4,249
  • 1
  • 18
  • 27
  • Thank you for your answer. Now I can see that using sed for this task is too complicated for me. I decided to use XML parser in python, it's actually very simple and effective. – awarus Dec 09 '18 at 11:54
0

You can use any XSLT-1.0 processor, like xsltproc, to transform your input XML to your desired output XML.

This is a possible XSLT-1.0 file:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

    <xsl:template match="/DOC">
      <doc>
        <xsl:apply-templates select="node()|@*" />
      </doc>
    </xsl:template>

    <xsl:template match="*">
      <field name="{translate(local-name(),'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')}">
        <xsl:value-of select="text()" />
      </field>
    </xsl:template>

</xsl:stylesheet>

Which output is:

<?xml version="1.0" encoding="UTF-8"?>
<doc>
    <field name="docid">MF-20020103001</field>
    <field name="date">01/03/02</field>
    <field name="title">Example title</field>
    <field name="text">Very long text...</field>
</doc>

You can get this by using the XSLT processor xsltproc:

xsltproc input.xslt input.xml
zx485
  • 28,498
  • 28
  • 50
  • 59