0

i want to process an very big XML file (> 3 Gigabyte) with python3, but the problem is that the xml file is incomplete like this :

<country name="Liechtenstein">
    <rank>1</rank>
    <year>2008</year>
    <neighbor name="Austria" direction="E"/>
</country>
<country name="Singapore">
    <rank>4</rank>
    <year>2011</year>
    <neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
    <rank>68</rank>

the result that i'm looking for is this :

<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
    <rank>1</rank>
    <year>2008</year>
    <neighbor name="Austria" direction="E"/>
</country>
<country name="Singapore">
    <rank>4</rank>
    <year>2011</year>
    <neighbor name="Malaysia" direction="N"/>
</country>
</data>

So, i have to add the header part (showed below) to the XML file :

<?xml version="1.0"?>
<data>

then, delete the incomplete part (showed below) of the xml file :

<country name="Panama">
    <rank>68</rank>

and finally, add the queue part (showed below) to the XML file :

</data>

ALL these process must be done by a Python script.

Thank for your help.

azedra
  • 13
  • 5

1 Answers1

0

Read successive lines into a buffer, print and empty the buffer when you have completed another <country>...</country> entry.

import fileinput

print('<?xml version="1.0"?>\n<data>\n')
country=[]
for line in fileinput.input():
    country.append(line)
    if '</country>' in line:
        print(''.join(country), end='')
        country=[]
print('</data>\n')

To avoid printing a spurious newline between entries, I use end=''. If you want a Python 2 solution, the fix is different for Python2.

Personally, I would write this in Awk, which is quite effective when it comes to this kind of task.

awk 'BEGIN { print "<?xml version=\"1.0\"?>\n<data>" }
    { b = b (b ? ORS : "" ) $0 }
    /<\/country>/ { print b; b=""; }
    END { print "</data>" }' country.xml

The ternary expression (b ? ORS : "") is to add a newline (Output Record Separator) only when b isn't empty, i.e. avoid adding a newline before the first member.

Community
  • 1
  • 1
tripleee
  • 175,061
  • 34
  • 275
  • 318