Converting XML into a pipe-delimited file using bash

Question

How can I remove the entry tag and convert this XML into a pipe-delimited file?

<entry><company>ABC</company><appname>XYZ</appname><appid>12345678</appid><updated>2014-04-29T20:58:00-07:00</updated><msgid>923605123</msgid><title>Crash</title><content type="text">Whenever you try to use the graph function.  I expect better from Schwab</content><version>4.1.3.6</version><rating>1</rating></entry>

Expected output format:

ABC|XYZ|12345678|2014-04-29T20:58:00-07:00|923605123|Crash|Whenever you try to use the graph function.  I expect better from Schwab|4.1.3.6|1|

Look into http://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash — RedX, May 28 '14 at 15:51
The Right Way to do this is to use a real XML-aware parser -- otherwise, you simply *won't* cover the corner cases (which include things like comments, entity expansion, and a multitude of other details). This means **not** using pure native bash -- though a bash-aware tool such as XMLStarlet can do the job. — Charles Duffy, May 29 '14 at 00:52
@konsolebox's suggestion of Ruby is fine. Python, similarly, has good XML processing libraries. Any XQuery or XPath engine you can run from shell could also be used from bash, and would work for this job as well. Any answer here that uses awk or sed **is simply wrong**, inasmuch as it won't be able to handle all (or even most) syntactically-equivalent formulations of your input. — Charles Duffy, May 29 '14 at 00:54

Charles Duffy · Answer 1 · 2014-05-29T01:12:17.977

Consider something akin to the following:

xmlstarlet sel -t -m '//entry' \
  -v ./company -o '|' \
  -v ./appname -o '|' \
  -v ./appid   -o '|' \
  -v ./content -n     \
  <test.xml

It would be possible to write a query which didn't call for spelling out each column in turn -- but writing it out is the better approach, as it ensures that column 3 in every line (in this case) always means appid, which otherwise isn't a guarantee that you have available.

Note that XMLStarlet, like many compliant parsers, requires a well-formed XML document -- meaning it the document being processed must have a single root element. If what you have is a file that contains a stream of documents (no root element in which the entries are contained), this can be faked; one ugly but functional way to do this follows: xmlstarlet ... < <(echo "<root>"; cat test.xml; echo "</root>"))

score 1 · Answer 2 · answered Nov 07 '21 at 19:41

1

With xidel:

xidel -s input.xml -e 'join(entry/*,"|")'

answered Nov 07 '21 at 19:41

Reino

3,203
1
13
21

score 0 · Answer 3 · answered May 28 '14 at 15:52

0

sed 's/<[^>]*>/|/g;s/||*/|/g' file1 > file2

Edited to remove ajacent "||" pairs

answered May 28 '14 at 15:52

Bruce K

749
5
15

score 0 · Accepted Answer · edited Nov 02 '21 at 09:18

0

awk '$1 {printf s++ ? "|" $0 : $0}' RS='<[^>]+>'

set Record Separator to a tag, example <entry>
only print "lines" with a field, AKA don't print the tags
if on the second "line" or more, print a |, otherwise just print the "line"

Result

ABC|XYZ|12345678|2014-04-29T20:58:00-07:00|923605123|Crash|Whenever you try to use the graph function.  I expect better from Schwab|4.1.3.6|1

edited Nov 02 '21 at 09:18

Nimantha

6,405
6
28
69

answered May 29 '14 at 00:49

Zombo

1
62
391
407

Converting XML into a pipe-delimited file using bash

4 Answers4