1

i have a xml that i want to parse into csv, like i started to work with awk, i would like to continue with it but i know it is possible to do it with over language like perl also i found xmlstarlet but i don't have permission to install on server so i'am open on over solutions. So my iinput xml is

<?xml version="1.0"?>
<root>
  <record>
   <id_client>50C</id_client>  
  <data>
          <mail>1@mail.com</mail>
          <adress>10  </adress>
          <num_tel>001</num_tel>
          <key>C</key>
      <contact>
        <name>toto</name>
        <birth>01/30/009</birth>
        <city>London</city>
      </contact>
  </data> 
  <data>
          <mail>2@gmaiil.com</mail>
          <adress>20</adress>
          <num_tel>02200</num_tel>
          <key>D1</key>
      <contact>
        <name>tata</name>
        <birth>02/08/2004</birth>
        <city>Bruges</city>
      </contact>
  </data> 
</record>
   <record>
   <id_client>70D</id_client>  
  <data>
          <mail>3@gmail.com</mail>
          <adress>7Bcd</adress>
          <num_tel>5555</num_tel>
          <key>D2</key>
      <contact>
        <name>titi</name>
        <birth>05/07/2014</birth>
        <city>Paris</city>
      </contact>
  </data>
  <data>
          <mail>4@gmail.com</mail>
          <adress>888</adress>
          <num_tel>881.0</num_tel>
          <key>D3</key>
      <contact>
        <name>awk</name>
        <birth>05/08/1999</birth>
        <city>Lisbone</city>
      </contact>
  </data>

I would like to output in an over file this csv with hearders

id_client;mail;num_tel;key 
50C;1@mail.com;001;C
50C,2@gmail.com;02200;D1
70D;3@gmail.com;5555;D2 
70D;4@gmail.com;881.0;D3
iceman225
  • 109
  • 1
  • 2
  • 9
  • That XML looks a bit broken. missing `` tags, what should be the closing `` tag is an opening ``, and the `` tag in the second record is only half there. Is the actual input data also broken, or is it just the example? – Wintermute Apr 14 '15 at 13:00
  • it my mistake i correct it, however the file is not broken when it is generated – iceman225 Apr 14 '15 at 13:04
  • Use an xml parser like xmllint. –  Apr 14 '15 at 13:07
  • 1
    Okay. Do you have access to any XML-processing tools such as xsltproc, xalan, xmllint, or xmlstarlet? You don't want to do this with awk or other plain-text tools. Can you install Perl modules from CPAN? Do you have access to Python, if all else fails? – Wintermute Apr 14 '15 at 13:09
  • 1
    I don't have permission to install anithing because it is customer servers just put my files and use it. you misread me i want to use awk because i am starting understand the syntax – iceman225 Apr 14 '15 at 13:18
  • Obligatory: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Wintermute Apr 14 '15 at 13:19
  • 1
    I think we do not misread you: awk is just a bad solution to the (xml) problem. XML data **requires** tools that really understand XML. – glenn jackman Apr 14 '15 at 13:21

5 Answers5

6

This answer is given in order to illustrate the text-based procedure to extract the info from the specific .xml formatting shown in the question description (the same .xml can be formatted differently -e.g. no line feeds- making the process described here unsuitable).

If possible, use a XML-specific tool as xmllint.

Text-based one liner:

cat input.xml | grep -e \<mail\> -e \<adress\> -e \<num_tel\> -e \<key\> | sed 's/<[^>]*>//g' | sed 's/^\s*//g; s/\s*$//g' | paste -d ";" - - - -

Explanation:

  1. Read input file (cat input.xml)
  2. Get the appropriate tags lines (with grep)
  3. Remove XML tags with, leaving only the tag contents (with sed)
  4. Trim spaces (with sed again; two expressions in a single sed command: one for the leading spaces and one for the traling spaces)
  5. Paste every 4 lines as columns (with paste)
  • hi, this is not taking the values properly if they have spaces. how so i ignore the spaces? – Anu Jul 29 '20 at 07:55
  • Hello Anu. I have updated the line to use sed instead of awk to trim the value. It is a bit longer statement, but provides the flexibility you want. – Ramón Gil Moreno Jul 30 '20 at 17:59
5

You're going to run into lots of problems parsing XML line-by-line: XML is not a line-oriented data format.

Use an XML-specific tool. Here's how simple it can be:

xmlstarlet sel -t \
  -m / -o "id_client;mail;num_tel;key" -n -b \
  -m /root/record/data -v ../id_client -o ";" -v mail -o ";" -v num_tel -o ";" -v key -n \
file.xml
id_client;mail;num_tel;key
50C;1@mail.com;001;C
50C;2@gmaiil.com;02200;D1
70D;3@gmail.com;5555;D2
70D;4@gmail.com;881.0;D3
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
2

With Python, which has an XML parser in its standard library and a decent chance of being preinstalled on the server to which you have to deploy:

#!/usr/bin/python

import xml.etree.ElementTree as ET
import sys

tree = ET.parse(sys.argv[1])
root = tree.getroot()

print "id_client;mail;num_tel;key"

# Rudimentary error handling: If a field is not there,
# print (nil) in its stead.    
def xml_read(node, key):
    p = node.find(key)
    if p is None:
        return "(nil)"
    return p.text

for r in root.iter("record"):
    for d in r.iter("data"):
        print xml_read(r, "id_client") + ";" + xml_read(d, "mail") + ";" + xml_read(d, "num_tel") + ";" + xml_read(d, "key")

Alternatively, if you have access to an XSLT processor (although I dare not hope for this), you could use the following stylesheet:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/root">id_client;mail;num_tel;key
<xsl:for-each select="record">
  <xsl:for-each select="data"><xsl:value-of select="../id_client"/>;<xsl:value-of select="mail"/>;<xsl:value-of select="num_tel"/>;<xsl:value-of select="key"/><xsl:text>&#xa;</xsl:text></xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet> 

Use

xsltproc filename.xsl filename.xml

or

xalan -xsl filename.xsl -in filename.xml

where filename.xsl is the file that contains the above XSLT. If you have a different XSLT processor, it will work just as well; consult its manpage to see how it wants to be invoked.

Wintermute
  • 42,983
  • 5
  • 77
  • 80
1

You could try this:

awk 'BEGIN{ RS="record"; FS="[<>]" } { print $10 "," $14 "," $18 }' file

Which is not the most portable way to do it. Better would be:

awk -F'[<>]' '$2 == "mail" || $2 == "adress" { printf "%s\, ", $3 }; $2 == "num_tel" { print $3 }' a

That way you can add other lines without a problem, as long as you don't change the keys.

ShellFish
  • 4,351
  • 1
  • 20
  • 33
0
#!/usr/bin/perl
use XML::DT;

my %handler=(
  -default  => sub{ $c},                # $c - element contents
  -type     => { data => "MAP" },       # data suns became (tag => $c)

  id_client => sub{ father(id=>$c);},
  data      => sub{ print father("id"),";$c->{mail};$c->{num_tel};$c->{key}\n"},
);
dt(shift, %handler);
JJoao
  • 4,891
  • 1
  • 18
  • 20