0

I have a xml file like this

<pr_id>01</pr_id>
    <uniprot>O11482</uniprot>
    <uniprot>O96642</uniprot>
    <uniprot>Q67845</uniprot>
    <column>
        <column_id>1</column_id>
        column_start>300</column_start>
        <column_end>334</column_end>
        <old_new>old</old_new>
        <comment></comment>
    </column>
    <column>
        <column_id>2</column_id>
        <column_start>335</column_start>
        <column_end>337</column_end>
        <old_new>new</old_new>
        <comment></comment>
      <pr_id>02</pr_id>
         <uniprot>P4455</uniprot>
         <uniprot>89WER8</uniprot>
         <uniprot>Q12845</uniprot>
          <column>
        <column_id>1</column_id>
        <column_start>12</column_start>
        <column_end>34</column_end>
        <old_new>old</old_new>
        <comment></comment>
       </column>
        <column>
        <column_id>2</column_id>
        <column_start>35</column_start>
        <column_end>37</column_end>
        <old_new>old</old_new>
        <comment></comment>

I would like to get the output as follows.

pr_id   uniprot  old_start  old_end
01      O11482   300         334
02      P4455    12          34
02      P4455    35          37

What is the easy way to achieve this? This is my first time to deal with xml files. Your valuable suggestions would be appreciated!

2 Answers2

2

In Gnu Awk version 4, you can use the split() function:

gawk -f a.awk file.xml

where a.awk is:

BEGIN {RS="^$"}
{
    n=split($0,a,/<\/?(uniprot|pr_id|column_start|column_end|old_new)>/,s)
    for (i=1; i<=n-1;i+=2) {
        if (s[i]=="<pr_id>") {pp=a[i+1]; up=0}
        if (s[i]=="<uniprot>" && up==0) {uu=a[i+1];up=1}
        if (s[i]=="<column_start>") ss=a[i+1]
        if (s[i]=="<column_end>") ee=a[i+1]
        if (s[i]=="<old_new>" && a[i+1]=="old") {
            p[++k]=pp
            u[k]=uu
            s[k]=ss
            e[k]=ee
        }
    }
}
END {
    fmt="%5s%10s%10s%10s\n"
    printf fmt, "pr_id", "uniprot", "old_start", "old_end"
    for (i=1; i<=k; i++)
        printf fmt,p[i],u[i],s[i],e[i]
}

Output:

pr_id   uniprot old_start   old_end
   01    O11482       300       334
   02     P4455        12        34
   02     P4455        35        37
Håkon Hægland
  • 39,012
  • 21
  • 81
  • 174
  • Thanks for your answer. I don't get my desired output. I got the ouput like this pr_id uniprot old_start old_end 01 O11482 I use ubuntu12.04 and just installed gawk by using the command sudo dpkg -i gawk_4.0.1+dfsg-2_amd64.deb.Please help me – user3194459 Jan 15 '14 at 01:25
  • @user3194459 I am also using Ubuntu 12.04. But I am using Gnu Awk version 4.1 (not version 4.0.1), maybe you could try version 4.1 instead? – Håkon Hægland Jan 15 '14 at 06:10
1

Depends on the size of the XML, but why not use python's minidom for XMLs up to a size of say 30 megs or SAX if you're above that.

Even Excel might do the trick, if you just need it once.

However all this depends on a well formed XML (drag it into a browser, or verify using XML tools of some sort). The XML you posted seems a bit off.

user3194532
  • 687
  • 7
  • 7