Awk parse xml to csv

Question

i have a xml that i want to parse into csv, like i started to work with awk, i would like to continue with it but i know it is possible to do it with over language like perl also i found xmlstarlet but i don't have permission to install on server so i'am open on over solutions. So my iinput xml is

<?xml version="1.0"?>
<root>
  <record>
   <id_client>50C</id_client>  
  <data>
          <mail>1@mail.com</mail>
          <adress>10  </adress>
          <num_tel>001</num_tel>
          <key>C</key>
      <contact>
        <name>toto</name>
        <birth>01/30/009</birth>
        <city>London</city>
      </contact>
  </data> 
  <data>
          <mail>2@gmaiil.com</mail>
          <adress>20</adress>
          <num_tel>02200</num_tel>
          <key>D1</key>
      <contact>
        <name>tata</name>
        <birth>02/08/2004</birth>
        <city>Bruges</city>
      </contact>
  </data> 
</record>
   <record>
   <id_client>70D</id_client>  
  <data>
          <mail>3@gmail.com</mail>
          <adress>7Bcd</adress>
          <num_tel>5555</num_tel>
          <key>D2</key>
      <contact>
        <name>titi</name>
        <birth>05/07/2014</birth>
        <city>Paris</city>
      </contact>
  </data>
  <data>
          <mail>4@gmail.com</mail>
          <adress>888</adress>
          <num_tel>881.0</num_tel>
          <key>D3</key>
      <contact>
        <name>awk</name>
        <birth>05/08/1999</birth>
        <city>Lisbone</city>
      </contact>
  </data>

I would like to output in an over file this csv with hearders

id_client;mail;num_tel;key 
50C;1@mail.com;001;C
50C,2@gmail.com;02200;D1
70D;3@gmail.com;5555;D2 
70D;4@gmail.com;881.0;D3

That XML looks a bit broken. missing `` tags, what should be the closing `` tag is an opening ``, and the `` tag in the second record is only half there. Is the actual input data also broken, or is it just the example? — Wintermute, Apr 14 '15 at 13:00
it my mistake i correct it, however the file is not broken when it is generated — iceman225, Apr 14 '15 at 13:04
Okay. Do you have access to any XML-processing tools such as xsltproc, xalan, xmllint, or xmlstarlet? You don't want to do this with awk or other plain-text tools. Can you install Perl modules from CPAN? Do you have access to Python, if all else fails? — Wintermute, Apr 14 '15 at 13:09
I don't have permission to install anithing because it is customer servers just put my files and use it. you misread me i want to use awk because i am starting understand the syntax — iceman225, Apr 14 '15 at 13:18
Obligatory: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Wintermute, Apr 14 '15 at 13:19
I think we do not misread you: awk is just a bad solution to the (xml) problem. XML data **requires** tools that really understand XML. — glenn jackman, Apr 14 '15 at 13:21

Ramón Gil Moreno · Answer 1 · 2020-07-30T17:58:00.673

6

This answer is given in order to illustrate the text-based procedure to extract the info from the specific .xml formatting shown in the question description (the same .xml can be formatted differently -e.g. no line feeds- making the process described here unsuitable).

If possible, use a XML-specific tool as xmllint.

Text-based one liner:

cat input.xml | grep -e \<mail\> -e \<adress\> -e \<num_tel\> -e \<key\> | sed 's/<[^>]*>//g' | sed 's/^\s*//g; s/\s*$//g' | paste -d ";" - - - -

Explanation:

Read input file (cat input.xml)
Get the appropriate tags lines (with grep)
Remove XML tags with, leaving only the tag contents (with sed)
Trim spaces (with sed again; two expressions in a single sed command: one for the leading spaces and one for the traling spaces)
Paste every 4 lines as columns (with paste)

edited Jul 30 '20 at 17:58

answered Apr 14 '15 at 13:29

Ramón Gil Moreno

809
5
19

hi, this is not taking the values properly if they have spaces. how so i ignore the spaces? – Anu Jul 29 '20 at 07:55
Hello Anu. I have updated the line to use sed instead of awk to trim the value. It is a bit longer statement, but provides the flexibility you want. – Ramón Gil Moreno Jul 30 '20 at 17:59

glenn jackman · Accepted Answer · 2015-04-14T21:02:54.970

5

You're going to run into lots of problems parsing XML line-by-line: XML is not a line-oriented data format.

Use an XML-specific tool. Here's how simple it can be:

xmlstarlet sel -t \
  -m / -o "id_client;mail;num_tel;key" -n -b \
  -m /root/record/data -v ../id_client -o ";" -v mail -o ";" -v num_tel -o ";" -v key -n \
file.xml

id_client;mail;num_tel;key
50C;1@mail.com;001;C
50C;2@gmaiil.com;02200;D1
70D;3@gmail.com;5555;D2
70D;4@gmail.com;881.0;D3

edited Apr 14 '15 at 21:02

answered Apr 14 '15 at 13:30

glenn jackman

238,783
38
220
352

is it possible to create file to excute automatically the xmlstarlet ? – iceman225 Apr 14 '15 at 14:09
thank you i'am goind to talk about xmlstarlet with my chef it exactly what we need to use! – iceman225 Apr 14 '15 at 15:13
You would simply wrap a shell script around it, just like you would do with an awk command. – glenn jackman Apr 14 '15 at 15:19

Wintermute · Answer 3 · 2015-04-14T15:15:25.363

With Python, which has an XML parser in its standard library and a decent chance of being preinstalled on the server to which you have to deploy:

#!/usr/bin/python

import xml.etree.ElementTree as ET
import sys

tree = ET.parse(sys.argv[1])
root = tree.getroot()

print "id_client;mail;num_tel;key"

# Rudimentary error handling: If a field is not there,
# print (nil) in its stead.    
def xml_read(node, key):
    p = node.find(key)
    if p is None:
        return "(nil)"
    return p.text

for r in root.iter("record"):
    for d in r.iter("data"):
        print xml_read(r, "id_client") + ";" + xml_read(d, "mail") + ";" + xml_read(d, "num_tel") + ";" + xml_read(d, "key")

Alternatively, if you have access to an XSLT processor (although I dare not hope for this), you could use the following stylesheet:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/root">id_client;mail;num_tel;key
<xsl:for-each select="record">
  <xsl:for-each select="data"><xsl:value-of select="../id_client"/>;<xsl:value-of select="mail"/>;<xsl:value-of select="num_tel"/>;<xsl:value-of select="key"/><xsl:text>&#xa;</xsl:text></xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

Use

xsltproc filename.xsl filename.xml

or

xalan -xsl filename.xsl -in filename.xml

where filename.xsl is the file that contains the above XSLT. If you have a different XSLT processor, it will work just as well; consult its manpage to see how it wants to be invoked.

Best answer; use XSLT. Python is a great choice too. Ruby also has an XML parser in its standard library. — glenn jackman, Apr 14 '15 at 13:34

ShellFish · Answer 4 · 2015-04-14T13:18:06.290

1

You could try this:

awk 'BEGIN{ RS="record"; FS="[<>]" } { print $10 "," $14 "," $18 }' file

Which is not the most portable way to do it. Better would be:

awk -F'[<>]' '$2 == "mail" || $2 == "adress" { printf "%s\, ", $3 }; $2 == "num_tel" { print $3 }' a

That way you can add other lines without a problem, as long as you don't change the keys.

edited Apr 14 '15 at 13:18

answered Apr 14 '15 at 13:10

ShellFish

4,351
1
20
33

JJoao · Answer 5 · 2015-04-22T07:31:09.693

0

#!/usr/bin/perl
use XML::DT;

my %handler=(
  -default  => sub{ $c},                # $c - element contents
  -type     => { data => "MAP" },       # data suns became (tag => $c)

  id_client => sub{ father(id=>$c);},
  data      => sub{ print father("id"),";$c->{mail};$c->{num_tel};$c->{key}\n"},
);
dt(shift, %handler);

edited Apr 22 '15 at 07:31

answered Apr 20 '15 at 15:41

JJoao

4,891
1
18
20

Awk parse xml to csv

5 Answers5