0

How can I remove the entry tag and convert this XML into a pipe-delimited file?

<entry><company>ABC</company><appname>XYZ</appname><appid>12345678</appid><updated>2014-04-29T20:58:00-07:00</updated><msgid>923605123</msgid><title>Crash</title><content type="text">Whenever you try to use the graph function.  I expect better from Schwab</content><version>4.1.3.6</version><rating>1</rating></entry>

Expected output format:

ABC|XYZ|12345678|2014-04-29T20:58:00-07:00|923605123|Crash|Whenever you try to use the graph function.  I expect better from Schwab|4.1.3.6|1|
Jordan Running
  • 102,619
  • 17
  • 182
  • 182
user3347931
  • 319
  • 1
  • 2
  • 9
  • 1
    Look into http://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash – RedX May 28 '14 at 15:51
  • Will the elements in the XML always be in the same order? – Jordan Running May 28 '14 at 15:54
  • I'd suggest using Ruby. It's easiest. – konsolebox May 28 '14 at 16:09
  • 1
    The Right Way to do this is to use a real XML-aware parser -- otherwise, you simply *won't* cover the corner cases (which include things like comments, entity expansion, and a multitude of other details). This means **not** using pure native bash -- though a bash-aware tool such as XMLStarlet can do the job. – Charles Duffy May 29 '14 at 00:52
  • @konsolebox's suggestion of Ruby is fine. Python, similarly, has good XML processing libraries. Any XQuery or XPath engine you can run from shell could also be used from bash, and would work for this job as well. Any answer here that uses awk or sed **is simply wrong**, inasmuch as it won't be able to handle all (or even most) syntactically-equivalent formulations of your input. – Charles Duffy May 29 '14 at 00:54

4 Answers4

1

Consider something akin to the following:

xmlstarlet sel -t -m '//entry' \
  -v ./company -o '|' \
  -v ./appname -o '|' \
  -v ./appid   -o '|' \
  -v ./content -n     \
  <test.xml

It would be possible to write a query which didn't call for spelling out each column in turn -- but writing it out is the better approach, as it ensures that column 3 in every line (in this case) always means appid, which otherwise isn't a guarantee that you have available.

Note that XMLStarlet, like many compliant parsers, requires a well-formed XML document -- meaning it the document being processed must have a single root element. If what you have is a file that contains a stream of documents (no root element in which the entries are contained), this can be faked; one ugly but functional way to do this follows: xmlstarlet ... < <(echo "<root>"; cat test.xml; echo "</root>"))

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
1

With :

xidel -s input.xml -e 'join(entry/*,"|")'
Reino
  • 3,203
  • 1
  • 13
  • 21
0
sed 's/<[^>]*>/|/g;s/||*/|/g' file1 > file2

Edited to remove ajacent "||" pairs

Bruce K
  • 749
  • 5
  • 15
0
awk '$1 {printf s++ ? "|" $0 : $0}' RS='<[^>]+>'
  • set Record Separator to a tag, example <entry>
  • only print "lines" with a field, AKA don't print the tags
  • if on the second "line" or more, print a |, otherwise just print the "line"

Result

ABC|XYZ|12345678|2014-04-29T20:58:00-07:00|923605123|Crash|Whenever you try to use the graph function.  I expect better from Schwab|4.1.3.6|1
Nimantha
  • 6,405
  • 6
  • 28
  • 69
Zombo
  • 1
  • 62
  • 391
  • 407