0

I have this awk/sed command

awk '{full=full$0}END{print full;}' initial.xml | sed 's|</Product>|</Product>\
|g' > final.xml

to break an XML doc containing large number of tags such that the new file will have all contents of the product node in a single line

I am trying to run it using os.system and subprocess module however this is wrapping all the contents of the file into one line.

Can anyone convert it into equivalent python script? Thanks!

Vansh Khurana
  • 657
  • 3
  • 9
  • 10
  • 1
    Why not use a XML parser instead? Take a look at the [ElementTree API](http://docs.python.org/2/library/xml.etree.elementtree.html). – Martijn Pieters Aug 30 '13 at 11:11
  • you `awk` code looks like it's missing a `+` in between `full` and `$0` – Paul Evans Aug 30 '13 at 11:12
  • To add to what @MartijnPieters said, look at the [lxml library](http://lxml.de/). – Mike Vella Aug 30 '13 at 11:13
  • @MikeVella: Which is an external library requiring installation. To extract just text from tags, the stdlib `xml.etree` library is plenty. – Martijn Pieters Aug 30 '13 at 11:17
  • I have to process the file such that the contents of a product tag are in the same line so that i can be sure that when i pass it to a mapper,the mapper has complete info of the product I am using a mapper for XML processing, to distribute the jobs and make it run faster, The data is really huge – Vansh Khurana Aug 30 '13 at 11:17
  • @PaulEvans: No, the Awk script is correct, but `tr -d '\n' – tripleee Aug 30 '13 at 11:19
  • @MartijnPieters lxml offers a lot of functionality, its worth the OP knowing about it and making their own decision as to whether they may need that functionality in the future. – Mike Vella Aug 30 '13 at 12:38

1 Answers1

1

Something like this?

from __future__ import print_function
import fileinput
for line in fileinput.input('initial.xml'):
    print(line.rstrip('\n').replace('</Product>','</Product>\n'),end='')

I'm using the print function because the default print in Python 2.x will add a space or newline after each set of output. There are various other ways to work around that, some of which involve buffering your output before printing it.

For the record, your problem could equally well be solved in just a simple Awk script.

awk '{ gsub(/<Product>/,"&\n"); printf $0 }' initial.xml

Printing output as it arrives without a trailing newline is going to be a lot more efficient than buffering the whole file and then printing it at the end, and of course, Awk has all the necessary facilities to do the substition as well. (gsub is not available in all dialects of Awk, though.)

Community
  • 1
  • 1
tripleee
  • 175,061
  • 34
  • 275
  • 318